{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mongodb-developer/GenAI-Showcase/blob/main/notebooks/rag/openai_text_3_emebdding.ipynb)\n",
        "\n",
        "[![View Article](https://img.shields.io/badge/View%20Article-blue)](https://www.mongodb.com/developer/products/atlas/using-openai-latest-embeddings-rag-system-mongodb/)\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "NdPJf8kP95h-"
      },
      "source": [
        "# Using OpenAI Latest Embeddings In A RAG System With MongoDB\n",
        "\n",
        "OpenAI recently released new embeddings and moderation models. This article explores the step-by-step implementation process of utilizing one of the new embedding models: text-embedding-3-small within a Retrieval Augmented Generation(RAG) System powered by MongoDB Atlas Vector Database.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "IlCxoKZD-K_R"
      },
      "source": [
        "## Step 1: Libraries Installation\n",
        "\n",
        "\n",
        "Below are brief explanations of the tools and libraries utilised within the implementation code:\n",
        "* **datasets**: This library is part of the Hugging Face ecosystem. By installing 'datasets', we gain access to a number of pre-processed and ready-to-use datasets, which are essential for training and fine-tuning machine learning models or benchmarking their performance.\n",
        "\n",
        "* **pandas**: A data science library that provides robust data structures and methods for data manipulation, processing and analysis.\n",
        "\n",
        "* **openai**: This is the official Python client library for accessing OpenAI's suite of AI models and tools, including GPT and embedding models.italicised text\n",
        "\n",
        "* **pymongo**: PyMongo is a Python toolkit for MongoDB. It enables interactions with a MongoDB database."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 1,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "BH4Khn7Vg7Fj",
        "outputId": "9150d806-de72-4bbd-8823-fbd9dc230f5e"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Collecting datasets\n",
            "  Downloading datasets-2.16.1-py3-none-any.whl (507 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m507.1/507.1 kB\u001b[0m \u001b[31m5.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hRequirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (1.5.3)\n",
            "Collecting openai\n",
            "  Downloading openai-1.10.0-py3-none-any.whl (225 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m225.1/225.1 kB\u001b[0m \u001b[31m24.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hCollecting pymongo\n",
            "  Downloading pymongo-4.6.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (677 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m677.1/677.1 kB\u001b[0m \u001b[31m39.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hRequirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from datasets) (3.13.1)\n",
            "Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from datasets) (1.23.5)\n",
            "Requirement already satisfied: pyarrow>=8.0.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (10.0.1)\n",
            "Requirement already satisfied: pyarrow-hotfix in /usr/local/lib/python3.10/dist-packages (from datasets) (0.6)\n",
            "Collecting dill<0.3.8,>=0.3.0 (from datasets)\n",
            "  Downloading dill-0.3.7-py3-none-any.whl (115 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m115.3/115.3 kB\u001b[0m \u001b[31m12.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hRequirement already satisfied: requests>=2.19.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (2.31.0)\n",
            "Requirement already satisfied: tqdm>=4.62.1 in /usr/local/lib/python3.10/dist-packages (from datasets) (4.66.1)\n",
            "Requirement already satisfied: xxhash in /usr/local/lib/python3.10/dist-packages (from datasets) (3.4.1)\n",
            "Collecting multiprocess (from datasets)\n",
            "  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m134.8/134.8 kB\u001b[0m \u001b[31m14.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hRequirement already satisfied: fsspec[http]<=2023.10.0,>=2023.1.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (2023.6.0)\n",
            "Requirement already satisfied: aiohttp in /usr/local/lib/python3.10/dist-packages (from datasets) (3.9.1)\n",
            "Requirement already satisfied: huggingface-hub>=0.19.4 in /usr/local/lib/python3.10/dist-packages (from datasets) (0.20.3)\n",
            "Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from datasets) (23.2)\n",
            "Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from datasets) (6.0.1)\n",
            "Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas) (2.8.2)\n",
            "Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas) (2023.3.post1)\n",
            "Requirement already satisfied: anyio<5,>=3.5.0 in /usr/local/lib/python3.10/dist-packages (from openai) (3.7.1)\n",
            "Requirement already satisfied: distro<2,>=1.7.0 in /usr/lib/python3/dist-packages (from openai) (1.7.0)\n",
            "Collecting httpx<1,>=0.23.0 (from openai)\n",
            "  Downloading httpx-0.26.0-py3-none-any.whl (75 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m75.9/75.9 kB\u001b[0m \u001b[31m8.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hRequirement already satisfied: pydantic<3,>=1.9.0 in /usr/local/lib/python3.10/dist-packages (from openai) (1.10.14)\n",
            "Requirement already satisfied: sniffio in /usr/local/lib/python3.10/dist-packages (from openai) (1.3.0)\n",
            "Collecting typing-extensions<5,>=4.7 (from openai)\n",
            "  Downloading typing_extensions-4.9.0-py3-none-any.whl (32 kB)\n",
            "Collecting dnspython<3.0.0,>=1.16.0 (from pymongo)\n",
            "  Downloading dnspython-2.5.0-py3-none-any.whl (305 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m305.4/305.4 kB\u001b[0m \u001b[31m28.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hRequirement already satisfied: idna>=2.8 in /usr/local/lib/python3.10/dist-packages (from anyio<5,>=3.5.0->openai) (3.6)\n",
            "Requirement already satisfied: exceptiongroup in /usr/local/lib/python3.10/dist-packages (from anyio<5,>=3.5.0->openai) (1.2.0)\n",
            "Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (23.2.0)\n",
            "Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (6.0.4)\n",
            "Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.9.4)\n",
            "Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.4.1)\n",
            "Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.3.1)\n",
            "Requirement already satisfied: async-timeout<5.0,>=4.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (4.0.3)\n",
            "Requirement already satisfied: certifi in /usr/local/lib/python3.10/dist-packages (from httpx<1,>=0.23.0->openai) (2023.11.17)\n",
            "Collecting httpcore==1.* (from httpx<1,>=0.23.0->openai)\n",
            "  Downloading httpcore-1.0.2-py3-none-any.whl (76 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m76.9/76.9 kB\u001b[0m \u001b[31m9.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hCollecting h11<0.15,>=0.13 (from httpcore==1.*->httpx<1,>=0.23.0->openai)\n",
            "  Downloading h11-0.14.0-py3-none-any.whl (58 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m58.3/58.3 kB\u001b[0m \u001b[31m3.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hRequirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.1->pandas) (1.16.0)\n",
            "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets) (3.3.2)\n",
            "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets) (2.0.7)\n",
            "INFO: pip is looking at multiple versions of multiprocess to determine which version is compatible with other requirements. This could take a while.\n",
            "Collecting multiprocess (from datasets)\n",
            "  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m134.8/134.8 kB\u001b[0m \u001b[31m14.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hInstalling collected packages: typing-extensions, h11, dnspython, dill, pymongo, multiprocess, httpcore, httpx, openai, datasets\n",
            "  Attempting uninstall: typing-extensions\n",
            "    Found existing installation: typing_extensions 4.5.0\n",
            "    Uninstalling typing_extensions-4.5.0:\n",
            "      Successfully uninstalled typing_extensions-4.5.0\n",
            "\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n",
            "llmx 0.0.15a0 requires cohere, which is not installed.\n",
            "llmx 0.0.15a0 requires tiktoken, which is not installed.\n",
            "tensorflow-probability 0.22.0 requires typing-extensions<4.6.0, but you have typing-extensions 4.9.0 which is incompatible.\u001b[0m\u001b[31m\n",
            "\u001b[0mSuccessfully installed datasets-2.16.1 dill-0.3.7 dnspython-2.5.0 h11-0.14.0 httpcore-1.0.2 httpx-0.26.0 multiprocess-0.70.15 openai-1.10.0 pymongo-4.6.1 typing-extensions-4.9.0\n"
          ]
        }
      ],
      "source": [
        "!pip install datasets pandas openai pymongo"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "cLOUQGGi-bJb"
      },
      "source": [
        "## Step 2: Data Loading\n",
        "\n",
        "\n",
        "Load the dataset titled [\"AIatMongoDB/embedded_movies\"](https://huggingface.co/datasets/AIatMongoDB/embedded_movies). This dataset is a collection of movie-related details that include attributes such as the title, release year, cast, plot and more. A unique feature of this dataset is the plot_embedding field for each movie. These embeddings are generated using OpenAI's text-embedding-ada-002 model.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 2,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 947,
          "referenced_widgets": [
            "ea04b37a21324337bec7cb20dea86f56",
            "2fe30671551b45719066821912e67cc6",
            "39a1be31b49e4a42819a14f792952b9f",
            "bffa8ac928a94aa19b3506327879ee05",
            "0d8907102d4a4beba76c057296d52bcf",
            "13e7387663b840efbabbacc7b02ecbf1",
            "6417cd7c833040d39e9cc4286eefe9aa",
            "6b134ba7d60d48999284332650a85785",
            "55f8d507700c4d03b38d861148f24fea",
            "182fd07b7bd040c4a91cd8da4cd7062f",
            "5218f9ff0f064fb0832ddb7308922082",
            "b956bc06fe3849deaa765f8ec4890635",
            "41cf9f5ecbe1417da972dcdd6b765741",
            "1a81f5d565ca479b85aa096a3a0db260",
            "35072030d3d64381b1ad1920bdf7487d",
            "12e551e39fd04cb489c67125ba4a9587",
            "c0a118cd37284c45bf29b19c8f5bc8f5",
            "e99edb4b89bc48e29c0086f8be068c14",
            "9c4d78299d064b9fb5d062b197be73f5",
            "130cc31ff46f445e9f7e607f8d8aca45",
            "26e5a2e0ef0745e7a1278272cbf478b1",
            "895e63dd0a994b2a98fd2279f2fcd488",
            "41c2a3273bf842848ed271fbfd450a63",
            "d13f0c1006b44683bb475ae56a88cf67",
            "df79edc3b5ed4454840acc5dbf11459d",
            "681ac17d999f473b897d361a8b2d3082",
            "ac331b35738c4557b6bbf3b38e901711",
            "d068109b4d3e44639b83487d81894bb9",
            "70ae19241fab4ac295a39c4499910a97",
            "7d1087206fb5440eaa64f802a976029d",
            "6ff3a4b06afd46018f7c595b575ecd1d",
            "15bf28adec2942c58e1101068abbe5cc",
            "b3937e60c4cc4831b6e18e65d6a2a4ce"
          ]
        },
        "id": "o9MvqN3zb9U6",
        "outputId": "d4b34f3c-2099-4d23-f262-77a16bb30721"
      },
      "outputs": [
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:88: UserWarning: \n",
            "The secret `HF_TOKEN` does not exist in your Colab secrets.\n",
            "To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.\n",
            "You will be able to reuse this secret in all of your notebooks.\n",
            "Please note that authentication is recommended but still optional to access public models or datasets.\n",
            "  warnings.warn(\n"
          ]
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "ea04b37a21324337bec7cb20dea86f56",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading readme:   0%|          | 0.00/2.71k [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "b956bc06fe3849deaa765f8ec4890635",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading data:   0%|          | 0.00/42.3M [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "41c2a3273bf842848ed271fbfd450a63",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Generating train split: 0 examples [00:00, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/html": [
              "\n",
              "  <div id=\"df-3d30de6c-7d66-494a-91e3-ce98214416c0\" class=\"colab-df-container\">\n",
              "    <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>writers</th>\n",
              "      <th>cast</th>\n",
              "      <th>plot</th>\n",
              "      <th>countries</th>\n",
              "      <th>directors</th>\n",
              "      <th>poster</th>\n",
              "      <th>genres</th>\n",
              "      <th>imdb</th>\n",
              "      <th>num_mflix_comments</th>\n",
              "      <th>runtime</th>\n",
              "      <th>fullplot</th>\n",
              "      <th>languages</th>\n",
              "      <th>title</th>\n",
              "      <th>awards</th>\n",
              "      <th>type</th>\n",
              "      <th>plot_embedding</th>\n",
              "      <th>rated</th>\n",
              "      <th>metacritic</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>[Charles W. Goddard (screenplay), Basil Dickey...</td>\n",
              "      <td>[Pearl White, Crane Wilbur, Paul Panzer, Edwar...</td>\n",
              "      <td>Young Pauline is left a lot of money when her ...</td>\n",
              "      <td>[USA]</td>\n",
              "      <td>[Louis J. Gasnier, Donald MacKenzie]</td>\n",
              "      <td>https://m.media-amazon.com/images/M/MV5BMzgxOD...</td>\n",
              "      <td>[Action]</td>\n",
              "      <td>{'id': 4465, 'rating': 7.6, 'votes': 744}</td>\n",
              "      <td>0</td>\n",
              "      <td>199.0</td>\n",
              "      <td>Young Pauline is left a lot of money when her ...</td>\n",
              "      <td>[English]</td>\n",
              "      <td>The Perils of Pauline</td>\n",
              "      <td>{'nominations': 0, 'text': '1 win.', 'wins': 1}</td>\n",
              "      <td>movie</td>\n",
              "      <td>[0.00072939653, -0.026834568, 0.013515796, -0....</td>\n",
              "      <td>None</td>\n",
              "      <td>NaN</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>[H.M. Walker (titles)]</td>\n",
              "      <td>[Harold Lloyd, Mildred Davis, 'Snub' Pollard, ...</td>\n",
              "      <td>A penniless young man tries to save an heiress...</td>\n",
              "      <td>[USA]</td>\n",
              "      <td>[Alfred J. Goulding, Hal Roach]</td>\n",
              "      <td>https://m.media-amazon.com/images/M/MV5BNzE1OW...</td>\n",
              "      <td>[Comedy, Short, Action]</td>\n",
              "      <td>{'id': 10146, 'rating': 7.0, 'votes': 639}</td>\n",
              "      <td>0</td>\n",
              "      <td>22.0</td>\n",
              "      <td>As a penniless man worries about how he will m...</td>\n",
              "      <td>[English]</td>\n",
              "      <td>From Hand to Mouth</td>\n",
              "      <td>{'nominations': 1, 'text': '1 nomination.', 'w...</td>\n",
              "      <td>movie</td>\n",
              "      <td>[-0.022837115, -0.022941574, 0.014937485, -0.0...</td>\n",
              "      <td>TV-G</td>\n",
              "      <td>NaN</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>[Herbert Brenon (adaptation), John Russell (ad...</td>\n",
              "      <td>[Ronald Colman, Neil Hamilton, Ralph Forbes, A...</td>\n",
              "      <td>Michael \"Beau\" Geste leaves England in disgrac...</td>\n",
              "      <td>[USA]</td>\n",
              "      <td>[Herbert Brenon]</td>\n",
              "      <td>None</td>\n",
              "      <td>[Action, Adventure, Drama]</td>\n",
              "      <td>{'id': 16634, 'rating': 6.9, 'votes': 222}</td>\n",
              "      <td>0</td>\n",
              "      <td>101.0</td>\n",
              "      <td>Michael \"Beau\" Geste leaves England in disgrac...</td>\n",
              "      <td>[English]</td>\n",
              "      <td>Beau Geste</td>\n",
              "      <td>{'nominations': 0, 'text': '1 win.', 'wins': 1}</td>\n",
              "      <td>movie</td>\n",
              "      <td>[0.00023330493, -0.028511643, 0.014653289, -0....</td>\n",
              "      <td>None</td>\n",
              "      <td>NaN</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>[Douglas Fairbanks (story), Jack Cunningham (a...</td>\n",
              "      <td>[Billie Dove, Tempe Pigott, Donald Crisp, Sam ...</td>\n",
              "      <td>Seeking revenge, an athletic young man joins t...</td>\n",
              "      <td>[USA]</td>\n",
              "      <td>[Albert Parker]</td>\n",
              "      <td>https://m.media-amazon.com/images/M/MV5BMzU0ND...</td>\n",
              "      <td>[Adventure, Action]</td>\n",
              "      <td>{'id': 16654, 'rating': 7.2, 'votes': 1146}</td>\n",
              "      <td>1</td>\n",
              "      <td>88.0</td>\n",
              "      <td>A nobleman vows to avenge the death of his fat...</td>\n",
              "      <td>None</td>\n",
              "      <td>The Black Pirate</td>\n",
              "      <td>{'nominations': 0, 'text': '1 win.', 'wins': 1}</td>\n",
              "      <td>movie</td>\n",
              "      <td>[-0.005927917, -0.033394486, 0.0015323418, -0....</td>\n",
              "      <td>None</td>\n",
              "      <td>NaN</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>[Ted Wilde (story), John Grey (story), Clyde B...</td>\n",
              "      <td>[Harold Lloyd, Jobyna Ralston, Noah Young, Jim...</td>\n",
              "      <td>An irresponsible young millionaire changes his...</td>\n",
              "      <td>[USA]</td>\n",
              "      <td>[Sam Taylor]</td>\n",
              "      <td>https://m.media-amazon.com/images/M/MV5BMTcxMT...</td>\n",
              "      <td>[Action, Comedy, Romance]</td>\n",
              "      <td>{'id': 16895, 'rating': 7.6, 'votes': 918}</td>\n",
              "      <td>0</td>\n",
              "      <td>58.0</td>\n",
              "      <td>The Uptown Boy, J. Harold Manners (Lloyd) is a...</td>\n",
              "      <td>[English]</td>\n",
              "      <td>For Heaven's Sake</td>\n",
              "      <td>{'nominations': 1, 'text': '1 nomination.', 'w...</td>\n",
              "      <td>movie</td>\n",
              "      <td>[-0.0059373598, -0.026604708, -0.0070914757, -...</td>\n",
              "      <td>PASSED</td>\n",
              "      <td>NaN</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "    <div class=\"colab-df-buttons\">\n",
              "\n",
              "  <div class=\"colab-df-container\">\n",
              "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-3d30de6c-7d66-494a-91e3-ce98214416c0')\"\n",
              "            title=\"Convert this dataframe to an interactive table.\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
              "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
              "  </svg>\n",
              "    </button>\n",
              "\n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    .colab-df-buttons div {\n",
              "      margin-bottom: 4px;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "    <script>\n",
              "      const buttonEl =\n",
              "        document.querySelector('#df-3d30de6c-7d66-494a-91e3-ce98214416c0 button.colab-df-convert');\n",
              "      buttonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "      async function convertToInteractive(key) {\n",
              "        const element = document.querySelector('#df-3d30de6c-7d66-494a-91e3-ce98214416c0');\n",
              "        const dataTable =\n",
              "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                    [key], {});\n",
              "        if (!dataTable) return;\n",
              "\n",
              "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "          + ' to learn more about interactive tables.';\n",
              "        element.innerHTML = '';\n",
              "        dataTable['output_type'] = 'display_data';\n",
              "        await google.colab.output.renderOutput(dataTable, element);\n",
              "        const docLink = document.createElement('div');\n",
              "        docLink.innerHTML = docLinkHtml;\n",
              "        element.appendChild(docLink);\n",
              "      }\n",
              "    </script>\n",
              "  </div>\n",
              "\n",
              "\n",
              "<div id=\"df-ef077d03-7ebf-4bcb-a8b0-e4610f3ddc20\">\n",
              "  <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-ef077d03-7ebf-4bcb-a8b0-e4610f3ddc20')\"\n",
              "            title=\"Suggest charts\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "     width=\"24px\">\n",
              "    <g>\n",
              "        <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
              "    </g>\n",
              "</svg>\n",
              "  </button>\n",
              "\n",
              "<style>\n",
              "  .colab-df-quickchart {\n",
              "      --bg-color: #E8F0FE;\n",
              "      --fill-color: #1967D2;\n",
              "      --hover-bg-color: #E2EBFA;\n",
              "      --hover-fill-color: #174EA6;\n",
              "      --disabled-fill-color: #AAA;\n",
              "      --disabled-bg-color: #DDD;\n",
              "  }\n",
              "\n",
              "  [theme=dark] .colab-df-quickchart {\n",
              "      --bg-color: #3B4455;\n",
              "      --fill-color: #D2E3FC;\n",
              "      --hover-bg-color: #434B5C;\n",
              "      --hover-fill-color: #FFFFFF;\n",
              "      --disabled-bg-color: #3B4455;\n",
              "      --disabled-fill-color: #666;\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart {\n",
              "    background-color: var(--bg-color);\n",
              "    border: none;\n",
              "    border-radius: 50%;\n",
              "    cursor: pointer;\n",
              "    display: none;\n",
              "    fill: var(--fill-color);\n",
              "    height: 32px;\n",
              "    padding: 0;\n",
              "    width: 32px;\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart:hover {\n",
              "    background-color: var(--hover-bg-color);\n",
              "    box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "    fill: var(--button-hover-fill-color);\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart-complete:disabled,\n",
              "  .colab-df-quickchart-complete:disabled:hover {\n",
              "    background-color: var(--disabled-bg-color);\n",
              "    fill: var(--disabled-fill-color);\n",
              "    box-shadow: none;\n",
              "  }\n",
              "\n",
              "  .colab-df-spinner {\n",
              "    border: 2px solid var(--fill-color);\n",
              "    border-color: transparent;\n",
              "    border-bottom-color: var(--fill-color);\n",
              "    animation:\n",
              "      spin 1s steps(1) infinite;\n",
              "  }\n",
              "\n",
              "  @keyframes spin {\n",
              "    0% {\n",
              "      border-color: transparent;\n",
              "      border-bottom-color: var(--fill-color);\n",
              "      border-left-color: var(--fill-color);\n",
              "    }\n",
              "    20% {\n",
              "      border-color: transparent;\n",
              "      border-left-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "    }\n",
              "    30% {\n",
              "      border-color: transparent;\n",
              "      border-left-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "      border-right-color: var(--fill-color);\n",
              "    }\n",
              "    40% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "    }\n",
              "    60% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "    }\n",
              "    80% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "      border-bottom-color: var(--fill-color);\n",
              "    }\n",
              "    90% {\n",
              "      border-color: transparent;\n",
              "      border-bottom-color: var(--fill-color);\n",
              "    }\n",
              "  }\n",
              "</style>\n",
              "\n",
              "  <script>\n",
              "    async function quickchart(key) {\n",
              "      const quickchartButtonEl =\n",
              "        document.querySelector('#' + key + ' button');\n",
              "      quickchartButtonEl.disabled = true;  // To prevent multiple clicks.\n",
              "      quickchartButtonEl.classList.add('colab-df-spinner');\n",
              "      try {\n",
              "        const charts = await google.colab.kernel.invokeFunction(\n",
              "            'suggestCharts', [key], {});\n",
              "      } catch (error) {\n",
              "        console.error('Error during call to suggestCharts:', error);\n",
              "      }\n",
              "      quickchartButtonEl.classList.remove('colab-df-spinner');\n",
              "      quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
              "    }\n",
              "    (() => {\n",
              "      let quickchartButtonEl =\n",
              "        document.querySelector('#df-ef077d03-7ebf-4bcb-a8b0-e4610f3ddc20 button');\n",
              "      quickchartButtonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "    })();\n",
              "  </script>\n",
              "</div>\n",
              "    </div>\n",
              "  </div>\n"
            ],
            "text/plain": [
              "                                             writers  \\\n",
              "0  [Charles W. Goddard (screenplay), Basil Dickey...   \n",
              "1                             [H.M. Walker (titles)]   \n",
              "2  [Herbert Brenon (adaptation), John Russell (ad...   \n",
              "3  [Douglas Fairbanks (story), Jack Cunningham (a...   \n",
              "4  [Ted Wilde (story), John Grey (story), Clyde B...   \n",
              "\n",
              "                                                cast  \\\n",
              "0  [Pearl White, Crane Wilbur, Paul Panzer, Edwar...   \n",
              "1  [Harold Lloyd, Mildred Davis, 'Snub' Pollard, ...   \n",
              "2  [Ronald Colman, Neil Hamilton, Ralph Forbes, A...   \n",
              "3  [Billie Dove, Tempe Pigott, Donald Crisp, Sam ...   \n",
              "4  [Harold Lloyd, Jobyna Ralston, Noah Young, Jim...   \n",
              "\n",
              "                                                plot countries  \\\n",
              "0  Young Pauline is left a lot of money when her ...     [USA]   \n",
              "1  A penniless young man tries to save an heiress...     [USA]   \n",
              "2  Michael \"Beau\" Geste leaves England in disgrac...     [USA]   \n",
              "3  Seeking revenge, an athletic young man joins t...     [USA]   \n",
              "4  An irresponsible young millionaire changes his...     [USA]   \n",
              "\n",
              "                              directors  \\\n",
              "0  [Louis J. Gasnier, Donald MacKenzie]   \n",
              "1       [Alfred J. Goulding, Hal Roach]   \n",
              "2                      [Herbert Brenon]   \n",
              "3                       [Albert Parker]   \n",
              "4                          [Sam Taylor]   \n",
              "\n",
              "                                              poster  \\\n",
              "0  https://m.media-amazon.com/images/M/MV5BMzgxOD...   \n",
              "1  https://m.media-amazon.com/images/M/MV5BNzE1OW...   \n",
              "2                                               None   \n",
              "3  https://m.media-amazon.com/images/M/MV5BMzU0ND...   \n",
              "4  https://m.media-amazon.com/images/M/MV5BMTcxMT...   \n",
              "\n",
              "                       genres                                         imdb  \\\n",
              "0                    [Action]    {'id': 4465, 'rating': 7.6, 'votes': 744}   \n",
              "1     [Comedy, Short, Action]   {'id': 10146, 'rating': 7.0, 'votes': 639}   \n",
              "2  [Action, Adventure, Drama]   {'id': 16634, 'rating': 6.9, 'votes': 222}   \n",
              "3         [Adventure, Action]  {'id': 16654, 'rating': 7.2, 'votes': 1146}   \n",
              "4   [Action, Comedy, Romance]   {'id': 16895, 'rating': 7.6, 'votes': 918}   \n",
              "\n",
              "   num_mflix_comments  runtime  \\\n",
              "0                   0    199.0   \n",
              "1                   0     22.0   \n",
              "2                   0    101.0   \n",
              "3                   1     88.0   \n",
              "4                   0     58.0   \n",
              "\n",
              "                                            fullplot  languages  \\\n",
              "0  Young Pauline is left a lot of money when her ...  [English]   \n",
              "1  As a penniless man worries about how he will m...  [English]   \n",
              "2  Michael \"Beau\" Geste leaves England in disgrac...  [English]   \n",
              "3  A nobleman vows to avenge the death of his fat...       None   \n",
              "4  The Uptown Boy, J. Harold Manners (Lloyd) is a...  [English]   \n",
              "\n",
              "                   title                                             awards  \\\n",
              "0  The Perils of Pauline    {'nominations': 0, 'text': '1 win.', 'wins': 1}   \n",
              "1     From Hand to Mouth  {'nominations': 1, 'text': '1 nomination.', 'w...   \n",
              "2             Beau Geste    {'nominations': 0, 'text': '1 win.', 'wins': 1}   \n",
              "3       The Black Pirate    {'nominations': 0, 'text': '1 win.', 'wins': 1}   \n",
              "4      For Heaven's Sake  {'nominations': 1, 'text': '1 nomination.', 'w...   \n",
              "\n",
              "    type                                     plot_embedding   rated  \\\n",
              "0  movie  [0.00072939653, -0.026834568, 0.013515796, -0....    None   \n",
              "1  movie  [-0.022837115, -0.022941574, 0.014937485, -0.0...    TV-G   \n",
              "2  movie  [0.00023330493, -0.028511643, 0.014653289, -0....    None   \n",
              "3  movie  [-0.005927917, -0.033394486, 0.0015323418, -0....    None   \n",
              "4  movie  [-0.0059373598, -0.026604708, -0.0070914757, -...  PASSED   \n",
              "\n",
              "   metacritic  \n",
              "0         NaN  \n",
              "1         NaN  \n",
              "2         NaN  \n",
              "3         NaN  \n",
              "4         NaN  "
            ]
          },
          "execution_count": 2,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "# 1. Load Dataset\n",
        "import pandas as pd\n",
        "from datasets import load_dataset\n",
        "\n",
        "# https://huggingface.co/datasets/MongoDB/embedded_movies\n",
        "dataset = load_dataset(\"MongoDB/embedded_movies\")\n",
        "\n",
        "# Convert the dataset to a pandas dataframe\n",
        "dataset_df = pd.DataFrame(dataset[\"train\"])\n",
        "\n",
        "dataset_df.head(5)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "WEH92IX5-211"
      },
      "source": [
        "## Step 3: Data Cleaning and Preparation\n",
        "\n",
        "The next step cleans the data and prepares it for the next stage, which creates a new embedding data point using the new OpenAI embedding model.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 3,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "YUYfhMDqmEqP",
        "outputId": "37c5888b-0bdc-4ec9-e690-04e11faa4cb0"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Columns: Index(['writers', 'cast', 'plot', 'countries', 'directors', 'poster', 'genres',\n",
            "       'imdb', 'num_mflix_comments', 'runtime', 'fullplot', 'languages',\n",
            "       'title', 'awards', 'type', 'plot_embedding', 'rated', 'metacritic'],\n",
            "      dtype='object')\n",
            "\n",
            "Number of rows and columns: (1500, 18)\n",
            "\n",
            "Basic Statistics for numerical data:\n",
            "       num_mflix_comments      runtime  metacritic\n",
            "count         1500.000000  1485.000000  572.000000\n",
            "mean             6.071333   111.977104   51.646853\n",
            "std             27.378982    42.090386   16.861996\n",
            "min              0.000000     6.000000    9.000000\n",
            "25%              0.000000    96.000000   40.000000\n",
            "50%              0.000000   106.000000   51.000000\n",
            "75%              1.000000   121.000000   63.000000\n",
            "max            158.000000  1256.000000   97.000000\n",
            "\n",
            "Number of missing values in each column:\n",
            "writers                13\n",
            "cast                    1\n",
            "plot                   27\n",
            "countries               0\n",
            "directors              13\n",
            "poster                 89\n",
            "genres                  0\n",
            "imdb                    0\n",
            "num_mflix_comments      0\n",
            "runtime                15\n",
            "fullplot               48\n",
            "languages               1\n",
            "title                   0\n",
            "awards                  0\n",
            "type                    0\n",
            "plot_embedding         28\n",
            "rated                 308\n",
            "metacritic            928\n",
            "dtype: int64\n"
          ]
        }
      ],
      "source": [
        "print(\"Columns:\", dataset_df.columns)\n",
        "print(\"\\nNumber of rows and columns:\", dataset_df.shape)\n",
        "print(\"\\nBasic Statistics for numerical data:\")\n",
        "print(dataset_df.describe())\n",
        "print(\"\\nNumber of missing values in each column:\")\n",
        "print(dataset_df.isnull().sum())"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 4,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 1000
        },
        "id": "FpR-OMSRhTgx",
        "outputId": "211651de-8d17-493e-d918-ab75aa8078ed"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "\n",
            "Number of missing values in each column after removal:\n",
            "writers                13\n",
            "cast                    1\n",
            "plot                    0\n",
            "countries               0\n",
            "directors              13\n",
            "poster                 78\n",
            "genres                  0\n",
            "imdb                    0\n",
            "num_mflix_comments      0\n",
            "runtime                14\n",
            "fullplot               21\n",
            "languages               1\n",
            "title                   0\n",
            "awards                  0\n",
            "type                    0\n",
            "plot_embedding          1\n",
            "rated                 284\n",
            "metacritic            903\n",
            "dtype: int64\n"
          ]
        },
        {
          "data": {
            "text/html": [
              "\n",
              "  <div id=\"df-d21550b6-ed66-4f28-86be-8d796bf5b598\" class=\"colab-df-container\">\n",
              "    <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>writers</th>\n",
              "      <th>cast</th>\n",
              "      <th>plot</th>\n",
              "      <th>countries</th>\n",
              "      <th>directors</th>\n",
              "      <th>poster</th>\n",
              "      <th>genres</th>\n",
              "      <th>imdb</th>\n",
              "      <th>num_mflix_comments</th>\n",
              "      <th>runtime</th>\n",
              "      <th>fullplot</th>\n",
              "      <th>languages</th>\n",
              "      <th>title</th>\n",
              "      <th>awards</th>\n",
              "      <th>type</th>\n",
              "      <th>rated</th>\n",
              "      <th>metacritic</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>[Charles W. Goddard (screenplay), Basil Dickey...</td>\n",
              "      <td>[Pearl White, Crane Wilbur, Paul Panzer, Edwar...</td>\n",
              "      <td>Young Pauline is left a lot of money when her ...</td>\n",
              "      <td>[USA]</td>\n",
              "      <td>[Louis J. Gasnier, Donald MacKenzie]</td>\n",
              "      <td>https://m.media-amazon.com/images/M/MV5BMzgxOD...</td>\n",
              "      <td>[Action]</td>\n",
              "      <td>{'id': 4465, 'rating': 7.6, 'votes': 744}</td>\n",
              "      <td>0</td>\n",
              "      <td>199.0</td>\n",
              "      <td>Young Pauline is left a lot of money when her ...</td>\n",
              "      <td>[English]</td>\n",
              "      <td>The Perils of Pauline</td>\n",
              "      <td>{'nominations': 0, 'text': '1 win.', 'wins': 1}</td>\n",
              "      <td>movie</td>\n",
              "      <td>None</td>\n",
              "      <td>NaN</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>[H.M. Walker (titles)]</td>\n",
              "      <td>[Harold Lloyd, Mildred Davis, 'Snub' Pollard, ...</td>\n",
              "      <td>A penniless young man tries to save an heiress...</td>\n",
              "      <td>[USA]</td>\n",
              "      <td>[Alfred J. Goulding, Hal Roach]</td>\n",
              "      <td>https://m.media-amazon.com/images/M/MV5BNzE1OW...</td>\n",
              "      <td>[Comedy, Short, Action]</td>\n",
              "      <td>{'id': 10146, 'rating': 7.0, 'votes': 639}</td>\n",
              "      <td>0</td>\n",
              "      <td>22.0</td>\n",
              "      <td>As a penniless man worries about how he will m...</td>\n",
              "      <td>[English]</td>\n",
              "      <td>From Hand to Mouth</td>\n",
              "      <td>{'nominations': 1, 'text': '1 nomination.', 'w...</td>\n",
              "      <td>movie</td>\n",
              "      <td>TV-G</td>\n",
              "      <td>NaN</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>[Herbert Brenon (adaptation), John Russell (ad...</td>\n",
              "      <td>[Ronald Colman, Neil Hamilton, Ralph Forbes, A...</td>\n",
              "      <td>Michael \"Beau\" Geste leaves England in disgrac...</td>\n",
              "      <td>[USA]</td>\n",
              "      <td>[Herbert Brenon]</td>\n",
              "      <td>None</td>\n",
              "      <td>[Action, Adventure, Drama]</td>\n",
              "      <td>{'id': 16634, 'rating': 6.9, 'votes': 222}</td>\n",
              "      <td>0</td>\n",
              "      <td>101.0</td>\n",
              "      <td>Michael \"Beau\" Geste leaves England in disgrac...</td>\n",
              "      <td>[English]</td>\n",
              "      <td>Beau Geste</td>\n",
              "      <td>{'nominations': 0, 'text': '1 win.', 'wins': 1}</td>\n",
              "      <td>movie</td>\n",
              "      <td>None</td>\n",
              "      <td>NaN</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>[Douglas Fairbanks (story), Jack Cunningham (a...</td>\n",
              "      <td>[Billie Dove, Tempe Pigott, Donald Crisp, Sam ...</td>\n",
              "      <td>Seeking revenge, an athletic young man joins t...</td>\n",
              "      <td>[USA]</td>\n",
              "      <td>[Albert Parker]</td>\n",
              "      <td>https://m.media-amazon.com/images/M/MV5BMzU0ND...</td>\n",
              "      <td>[Adventure, Action]</td>\n",
              "      <td>{'id': 16654, 'rating': 7.2, 'votes': 1146}</td>\n",
              "      <td>1</td>\n",
              "      <td>88.0</td>\n",
              "      <td>A nobleman vows to avenge the death of his fat...</td>\n",
              "      <td>None</td>\n",
              "      <td>The Black Pirate</td>\n",
              "      <td>{'nominations': 0, 'text': '1 win.', 'wins': 1}</td>\n",
              "      <td>movie</td>\n",
              "      <td>None</td>\n",
              "      <td>NaN</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>[Ted Wilde (story), John Grey (story), Clyde B...</td>\n",
              "      <td>[Harold Lloyd, Jobyna Ralston, Noah Young, Jim...</td>\n",
              "      <td>An irresponsible young millionaire changes his...</td>\n",
              "      <td>[USA]</td>\n",
              "      <td>[Sam Taylor]</td>\n",
              "      <td>https://m.media-amazon.com/images/M/MV5BMTcxMT...</td>\n",
              "      <td>[Action, Comedy, Romance]</td>\n",
              "      <td>{'id': 16895, 'rating': 7.6, 'votes': 918}</td>\n",
              "      <td>0</td>\n",
              "      <td>58.0</td>\n",
              "      <td>The Uptown Boy, J. Harold Manners (Lloyd) is a...</td>\n",
              "      <td>[English]</td>\n",
              "      <td>For Heaven's Sake</td>\n",
              "      <td>{'nominations': 1, 'text': '1 nomination.', 'w...</td>\n",
              "      <td>movie</td>\n",
              "      <td>PASSED</td>\n",
              "      <td>NaN</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "    <div class=\"colab-df-buttons\">\n",
              "\n",
              "  <div class=\"colab-df-container\">\n",
              "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-d21550b6-ed66-4f28-86be-8d796bf5b598')\"\n",
              "            title=\"Convert this dataframe to an interactive table.\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
              "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
              "  </svg>\n",
              "    </button>\n",
              "\n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    .colab-df-buttons div {\n",
              "      margin-bottom: 4px;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "    <script>\n",
              "      const buttonEl =\n",
              "        document.querySelector('#df-d21550b6-ed66-4f28-86be-8d796bf5b598 button.colab-df-convert');\n",
              "      buttonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "      async function convertToInteractive(key) {\n",
              "        const element = document.querySelector('#df-d21550b6-ed66-4f28-86be-8d796bf5b598');\n",
              "        const dataTable =\n",
              "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                    [key], {});\n",
              "        if (!dataTable) return;\n",
              "\n",
              "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "          + ' to learn more about interactive tables.';\n",
              "        element.innerHTML = '';\n",
              "        dataTable['output_type'] = 'display_data';\n",
              "        await google.colab.output.renderOutput(dataTable, element);\n",
              "        const docLink = document.createElement('div');\n",
              "        docLink.innerHTML = docLinkHtml;\n",
              "        element.appendChild(docLink);\n",
              "      }\n",
              "    </script>\n",
              "  </div>\n",
              "\n",
              "\n",
              "<div id=\"df-30ea2d6e-8ad7-46f2-a07e-48f1589b3ee9\">\n",
              "  <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-30ea2d6e-8ad7-46f2-a07e-48f1589b3ee9')\"\n",
              "            title=\"Suggest charts\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "     width=\"24px\">\n",
              "    <g>\n",
              "        <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
              "    </g>\n",
              "</svg>\n",
              "  </button>\n",
              "\n",
              "<style>\n",
              "  .colab-df-quickchart {\n",
              "      --bg-color: #E8F0FE;\n",
              "      --fill-color: #1967D2;\n",
              "      --hover-bg-color: #E2EBFA;\n",
              "      --hover-fill-color: #174EA6;\n",
              "      --disabled-fill-color: #AAA;\n",
              "      --disabled-bg-color: #DDD;\n",
              "  }\n",
              "\n",
              "  [theme=dark] .colab-df-quickchart {\n",
              "      --bg-color: #3B4455;\n",
              "      --fill-color: #D2E3FC;\n",
              "      --hover-bg-color: #434B5C;\n",
              "      --hover-fill-color: #FFFFFF;\n",
              "      --disabled-bg-color: #3B4455;\n",
              "      --disabled-fill-color: #666;\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart {\n",
              "    background-color: var(--bg-color);\n",
              "    border: none;\n",
              "    border-radius: 50%;\n",
              "    cursor: pointer;\n",
              "    display: none;\n",
              "    fill: var(--fill-color);\n",
              "    height: 32px;\n",
              "    padding: 0;\n",
              "    width: 32px;\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart:hover {\n",
              "    background-color: var(--hover-bg-color);\n",
              "    box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "    fill: var(--button-hover-fill-color);\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart-complete:disabled,\n",
              "  .colab-df-quickchart-complete:disabled:hover {\n",
              "    background-color: var(--disabled-bg-color);\n",
              "    fill: var(--disabled-fill-color);\n",
              "    box-shadow: none;\n",
              "  }\n",
              "\n",
              "  .colab-df-spinner {\n",
              "    border: 2px solid var(--fill-color);\n",
              "    border-color: transparent;\n",
              "    border-bottom-color: var(--fill-color);\n",
              "    animation:\n",
              "      spin 1s steps(1) infinite;\n",
              "  }\n",
              "\n",
              "  @keyframes spin {\n",
              "    0% {\n",
              "      border-color: transparent;\n",
              "      border-bottom-color: var(--fill-color);\n",
              "      border-left-color: var(--fill-color);\n",
              "    }\n",
              "    20% {\n",
              "      border-color: transparent;\n",
              "      border-left-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "    }\n",
              "    30% {\n",
              "      border-color: transparent;\n",
              "      border-left-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "      border-right-color: var(--fill-color);\n",
              "    }\n",
              "    40% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "    }\n",
              "    60% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "    }\n",
              "    80% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "      border-bottom-color: var(--fill-color);\n",
              "    }\n",
              "    90% {\n",
              "      border-color: transparent;\n",
              "      border-bottom-color: var(--fill-color);\n",
              "    }\n",
              "  }\n",
              "</style>\n",
              "\n",
              "  <script>\n",
              "    async function quickchart(key) {\n",
              "      const quickchartButtonEl =\n",
              "        document.querySelector('#' + key + ' button');\n",
              "      quickchartButtonEl.disabled = true;  // To prevent multiple clicks.\n",
              "      quickchartButtonEl.classList.add('colab-df-spinner');\n",
              "      try {\n",
              "        const charts = await google.colab.kernel.invokeFunction(\n",
              "            'suggestCharts', [key], {});\n",
              "      } catch (error) {\n",
              "        console.error('Error during call to suggestCharts:', error);\n",
              "      }\n",
              "      quickchartButtonEl.classList.remove('colab-df-spinner');\n",
              "      quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
              "    }\n",
              "    (() => {\n",
              "      let quickchartButtonEl =\n",
              "        document.querySelector('#df-30ea2d6e-8ad7-46f2-a07e-48f1589b3ee9 button');\n",
              "      quickchartButtonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "    })();\n",
              "  </script>\n",
              "</div>\n",
              "    </div>\n",
              "  </div>\n"
            ],
            "text/plain": [
              "                                             writers  \\\n",
              "0  [Charles W. Goddard (screenplay), Basil Dickey...   \n",
              "1                             [H.M. Walker (titles)]   \n",
              "2  [Herbert Brenon (adaptation), John Russell (ad...   \n",
              "3  [Douglas Fairbanks (story), Jack Cunningham (a...   \n",
              "4  [Ted Wilde (story), John Grey (story), Clyde B...   \n",
              "\n",
              "                                                cast  \\\n",
              "0  [Pearl White, Crane Wilbur, Paul Panzer, Edwar...   \n",
              "1  [Harold Lloyd, Mildred Davis, 'Snub' Pollard, ...   \n",
              "2  [Ronald Colman, Neil Hamilton, Ralph Forbes, A...   \n",
              "3  [Billie Dove, Tempe Pigott, Donald Crisp, Sam ...   \n",
              "4  [Harold Lloyd, Jobyna Ralston, Noah Young, Jim...   \n",
              "\n",
              "                                                plot countries  \\\n",
              "0  Young Pauline is left a lot of money when her ...     [USA]   \n",
              "1  A penniless young man tries to save an heiress...     [USA]   \n",
              "2  Michael \"Beau\" Geste leaves England in disgrac...     [USA]   \n",
              "3  Seeking revenge, an athletic young man joins t...     [USA]   \n",
              "4  An irresponsible young millionaire changes his...     [USA]   \n",
              "\n",
              "                              directors  \\\n",
              "0  [Louis J. Gasnier, Donald MacKenzie]   \n",
              "1       [Alfred J. Goulding, Hal Roach]   \n",
              "2                      [Herbert Brenon]   \n",
              "3                       [Albert Parker]   \n",
              "4                          [Sam Taylor]   \n",
              "\n",
              "                                              poster  \\\n",
              "0  https://m.media-amazon.com/images/M/MV5BMzgxOD...   \n",
              "1  https://m.media-amazon.com/images/M/MV5BNzE1OW...   \n",
              "2                                               None   \n",
              "3  https://m.media-amazon.com/images/M/MV5BMzU0ND...   \n",
              "4  https://m.media-amazon.com/images/M/MV5BMTcxMT...   \n",
              "\n",
              "                       genres                                         imdb  \\\n",
              "0                    [Action]    {'id': 4465, 'rating': 7.6, 'votes': 744}   \n",
              "1     [Comedy, Short, Action]   {'id': 10146, 'rating': 7.0, 'votes': 639}   \n",
              "2  [Action, Adventure, Drama]   {'id': 16634, 'rating': 6.9, 'votes': 222}   \n",
              "3         [Adventure, Action]  {'id': 16654, 'rating': 7.2, 'votes': 1146}   \n",
              "4   [Action, Comedy, Romance]   {'id': 16895, 'rating': 7.6, 'votes': 918}   \n",
              "\n",
              "   num_mflix_comments  runtime  \\\n",
              "0                   0    199.0   \n",
              "1                   0     22.0   \n",
              "2                   0    101.0   \n",
              "3                   1     88.0   \n",
              "4                   0     58.0   \n",
              "\n",
              "                                            fullplot  languages  \\\n",
              "0  Young Pauline is left a lot of money when her ...  [English]   \n",
              "1  As a penniless man worries about how he will m...  [English]   \n",
              "2  Michael \"Beau\" Geste leaves England in disgrac...  [English]   \n",
              "3  A nobleman vows to avenge the death of his fat...       None   \n",
              "4  The Uptown Boy, J. Harold Manners (Lloyd) is a...  [English]   \n",
              "\n",
              "                   title                                             awards  \\\n",
              "0  The Perils of Pauline    {'nominations': 0, 'text': '1 win.', 'wins': 1}   \n",
              "1     From Hand to Mouth  {'nominations': 1, 'text': '1 nomination.', 'w...   \n",
              "2             Beau Geste    {'nominations': 0, 'text': '1 win.', 'wins': 1}   \n",
              "3       The Black Pirate    {'nominations': 0, 'text': '1 win.', 'wins': 1}   \n",
              "4      For Heaven's Sake  {'nominations': 1, 'text': '1 nomination.', 'w...   \n",
              "\n",
              "    type   rated  metacritic  \n",
              "0  movie    None         NaN  \n",
              "1  movie    TV-G         NaN  \n",
              "2  movie    None         NaN  \n",
              "3  movie    None         NaN  \n",
              "4  movie  PASSED         NaN  "
            ]
          },
          "execution_count": 4,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "# Remove data point where plot coloumn is missing\n",
        "dataset_df = dataset_df.dropna(subset=[\"plot\"])\n",
        "print(\"\\nNumber of missing values in each column after removal:\")\n",
        "print(dataset_df.isnull().sum())\n",
        "\n",
        "# Remove the plot_embedding from each data point in the dataset as we are going to create new embeddings with the new OpenAI emebedding Model \"text-embedding-3-small\"\n",
        "dataset_df = dataset_df.drop(columns=[\"plot_embedding\"])\n",
        "dataset_df.head(5)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "MusOoknJ_BNu"
      },
      "source": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "vsL_mt0h_BQP"
      },
      "source": [
        "## Step 4: Create embeddings with OpenAI\n",
        "\n",
        "This stage focuses on generating new embeddings using OpenAI's advanced model.\n",
        "This demonstration utilises a Google Colab Notebook, where environment variables are configured explicitly within the notebook's Secret section and accessed using the user data module. In a production environment, the environment variables that store secret keys are usually stored in a '.env' file or equivalent.\n",
        "\n",
        "An [OpenAI API](https://help.openai.com/en/articles/4936850-where-do-i-find-my-api-key) key is required to ensure the successful completion of this step. More details on OpenAI's embedding models can be found on the official [site](https://platform.openai.com/docs/guides/embeddings).\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 5,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 747
        },
        "id": "xvMZRlRGc1Gn",
        "outputId": "9f448e0d-7618-4ddc-8dfd-8cec81226511"
      },
      "outputs": [
        {
          "data": {
            "text/html": [
              "\n",
              "  <div id=\"df-e5d66acf-7845-4c74-a5a4-315e5b2ebb45\" class=\"colab-df-container\">\n",
              "    <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>writers</th>\n",
              "      <th>cast</th>\n",
              "      <th>plot</th>\n",
              "      <th>countries</th>\n",
              "      <th>directors</th>\n",
              "      <th>poster</th>\n",
              "      <th>genres</th>\n",
              "      <th>imdb</th>\n",
              "      <th>num_mflix_comments</th>\n",
              "      <th>runtime</th>\n",
              "      <th>fullplot</th>\n",
              "      <th>languages</th>\n",
              "      <th>title</th>\n",
              "      <th>awards</th>\n",
              "      <th>type</th>\n",
              "      <th>rated</th>\n",
              "      <th>metacritic</th>\n",
              "      <th>plot_embedding_optimised</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>[Charles W. Goddard (screenplay), Basil Dickey...</td>\n",
              "      <td>[Pearl White, Crane Wilbur, Paul Panzer, Edwar...</td>\n",
              "      <td>Young Pauline is left a lot of money when her ...</td>\n",
              "      <td>[USA]</td>\n",
              "      <td>[Louis J. Gasnier, Donald MacKenzie]</td>\n",
              "      <td>https://m.media-amazon.com/images/M/MV5BMzgxOD...</td>\n",
              "      <td>[Action]</td>\n",
              "      <td>{'id': 4465, 'rating': 7.6, 'votes': 744}</td>\n",
              "      <td>0</td>\n",
              "      <td>199.0</td>\n",
              "      <td>Young Pauline is left a lot of money when her ...</td>\n",
              "      <td>[English]</td>\n",
              "      <td>The Perils of Pauline</td>\n",
              "      <td>{'nominations': 0, 'text': '1 win.', 'wins': 1}</td>\n",
              "      <td>movie</td>\n",
              "      <td>None</td>\n",
              "      <td>NaN</td>\n",
              "      <td>[0.015450738370418549, -0.0037871389649808407,...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>[H.M. Walker (titles)]</td>\n",
              "      <td>[Harold Lloyd, Mildred Davis, 'Snub' Pollard, ...</td>\n",
              "      <td>A penniless young man tries to save an heiress...</td>\n",
              "      <td>[USA]</td>\n",
              "      <td>[Alfred J. Goulding, Hal Roach]</td>\n",
              "      <td>https://m.media-amazon.com/images/M/MV5BNzE1OW...</td>\n",
              "      <td>[Comedy, Short, Action]</td>\n",
              "      <td>{'id': 10146, 'rating': 7.0, 'votes': 639}</td>\n",
              "      <td>0</td>\n",
              "      <td>22.0</td>\n",
              "      <td>As a penniless man worries about how he will m...</td>\n",
              "      <td>[English]</td>\n",
              "      <td>From Hand to Mouth</td>\n",
              "      <td>{'nominations': 1, 'text': '1 nomination.', 'w...</td>\n",
              "      <td>movie</td>\n",
              "      <td>TV-G</td>\n",
              "      <td>NaN</td>\n",
              "      <td>[-0.024403288960456848, 0.009791935794055462, ...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>[Herbert Brenon (adaptation), John Russell (ad...</td>\n",
              "      <td>[Ronald Colman, Neil Hamilton, Ralph Forbes, A...</td>\n",
              "      <td>Michael \"Beau\" Geste leaves England in disgrac...</td>\n",
              "      <td>[USA]</td>\n",
              "      <td>[Herbert Brenon]</td>\n",
              "      <td>None</td>\n",
              "      <td>[Action, Adventure, Drama]</td>\n",
              "      <td>{'id': 16634, 'rating': 6.9, 'votes': 222}</td>\n",
              "      <td>0</td>\n",
              "      <td>101.0</td>\n",
              "      <td>Michael \"Beau\" Geste leaves England in disgrac...</td>\n",
              "      <td>[English]</td>\n",
              "      <td>Beau Geste</td>\n",
              "      <td>{'nominations': 0, 'text': '1 win.', 'wins': 1}</td>\n",
              "      <td>movie</td>\n",
              "      <td>None</td>\n",
              "      <td>NaN</td>\n",
              "      <td>[-0.0314427949488163, 0.07585323601961136, 0.0...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>[Douglas Fairbanks (story), Jack Cunningham (a...</td>\n",
              "      <td>[Billie Dove, Tempe Pigott, Donald Crisp, Sam ...</td>\n",
              "      <td>Seeking revenge, an athletic young man joins t...</td>\n",
              "      <td>[USA]</td>\n",
              "      <td>[Albert Parker]</td>\n",
              "      <td>https://m.media-amazon.com/images/M/MV5BMzU0ND...</td>\n",
              "      <td>[Adventure, Action]</td>\n",
              "      <td>{'id': 16654, 'rating': 7.2, 'votes': 1146}</td>\n",
              "      <td>1</td>\n",
              "      <td>88.0</td>\n",
              "      <td>A nobleman vows to avenge the death of his fat...</td>\n",
              "      <td>None</td>\n",
              "      <td>The Black Pirate</td>\n",
              "      <td>{'nominations': 0, 'text': '1 win.', 'wins': 1}</td>\n",
              "      <td>movie</td>\n",
              "      <td>None</td>\n",
              "      <td>NaN</td>\n",
              "      <td>[0.021738531067967415, 0.06848915666341782, 0....</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>[Ted Wilde (story), John Grey (story), Clyde B...</td>\n",
              "      <td>[Harold Lloyd, Jobyna Ralston, Noah Young, Jim...</td>\n",
              "      <td>An irresponsible young millionaire changes his...</td>\n",
              "      <td>[USA]</td>\n",
              "      <td>[Sam Taylor]</td>\n",
              "      <td>https://m.media-amazon.com/images/M/MV5BMTcxMT...</td>\n",
              "      <td>[Action, Comedy, Romance]</td>\n",
              "      <td>{'id': 16895, 'rating': 7.6, 'votes': 918}</td>\n",
              "      <td>0</td>\n",
              "      <td>58.0</td>\n",
              "      <td>The Uptown Boy, J. Harold Manners (Lloyd) is a...</td>\n",
              "      <td>[English]</td>\n",
              "      <td>For Heaven's Sake</td>\n",
              "      <td>{'nominations': 1, 'text': '1 nomination.', 'w...</td>\n",
              "      <td>movie</td>\n",
              "      <td>PASSED</td>\n",
              "      <td>NaN</td>\n",
              "      <td>[0.008178990334272385, -0.019387762993574142, ...</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "    <div class=\"colab-df-buttons\">\n",
              "\n",
              "  <div class=\"colab-df-container\">\n",
              "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-e5d66acf-7845-4c74-a5a4-315e5b2ebb45')\"\n",
              "            title=\"Convert this dataframe to an interactive table.\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
              "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
              "  </svg>\n",
              "    </button>\n",
              "\n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    .colab-df-buttons div {\n",
              "      margin-bottom: 4px;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "    <script>\n",
              "      const buttonEl =\n",
              "        document.querySelector('#df-e5d66acf-7845-4c74-a5a4-315e5b2ebb45 button.colab-df-convert');\n",
              "      buttonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "      async function convertToInteractive(key) {\n",
              "        const element = document.querySelector('#df-e5d66acf-7845-4c74-a5a4-315e5b2ebb45');\n",
              "        const dataTable =\n",
              "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                    [key], {});\n",
              "        if (!dataTable) return;\n",
              "\n",
              "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "          + ' to learn more about interactive tables.';\n",
              "        element.innerHTML = '';\n",
              "        dataTable['output_type'] = 'display_data';\n",
              "        await google.colab.output.renderOutput(dataTable, element);\n",
              "        const docLink = document.createElement('div');\n",
              "        docLink.innerHTML = docLinkHtml;\n",
              "        element.appendChild(docLink);\n",
              "      }\n",
              "    </script>\n",
              "  </div>\n",
              "\n",
              "\n",
              "<div id=\"df-828d1b5a-cf35-427a-8136-5f3779b049aa\">\n",
              "  <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-828d1b5a-cf35-427a-8136-5f3779b049aa')\"\n",
              "            title=\"Suggest charts\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "     width=\"24px\">\n",
              "    <g>\n",
              "        <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
              "    </g>\n",
              "</svg>\n",
              "  </button>\n",
              "\n",
              "<style>\n",
              "  .colab-df-quickchart {\n",
              "      --bg-color: #E8F0FE;\n",
              "      --fill-color: #1967D2;\n",
              "      --hover-bg-color: #E2EBFA;\n",
              "      --hover-fill-color: #174EA6;\n",
              "      --disabled-fill-color: #AAA;\n",
              "      --disabled-bg-color: #DDD;\n",
              "  }\n",
              "\n",
              "  [theme=dark] .colab-df-quickchart {\n",
              "      --bg-color: #3B4455;\n",
              "      --fill-color: #D2E3FC;\n",
              "      --hover-bg-color: #434B5C;\n",
              "      --hover-fill-color: #FFFFFF;\n",
              "      --disabled-bg-color: #3B4455;\n",
              "      --disabled-fill-color: #666;\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart {\n",
              "    background-color: var(--bg-color);\n",
              "    border: none;\n",
              "    border-radius: 50%;\n",
              "    cursor: pointer;\n",
              "    display: none;\n",
              "    fill: var(--fill-color);\n",
              "    height: 32px;\n",
              "    padding: 0;\n",
              "    width: 32px;\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart:hover {\n",
              "    background-color: var(--hover-bg-color);\n",
              "    box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "    fill: var(--button-hover-fill-color);\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart-complete:disabled,\n",
              "  .colab-df-quickchart-complete:disabled:hover {\n",
              "    background-color: var(--disabled-bg-color);\n",
              "    fill: var(--disabled-fill-color);\n",
              "    box-shadow: none;\n",
              "  }\n",
              "\n",
              "  .colab-df-spinner {\n",
              "    border: 2px solid var(--fill-color);\n",
              "    border-color: transparent;\n",
              "    border-bottom-color: var(--fill-color);\n",
              "    animation:\n",
              "      spin 1s steps(1) infinite;\n",
              "  }\n",
              "\n",
              "  @keyframes spin {\n",
              "    0% {\n",
              "      border-color: transparent;\n",
              "      border-bottom-color: var(--fill-color);\n",
              "      border-left-color: var(--fill-color);\n",
              "    }\n",
              "    20% {\n",
              "      border-color: transparent;\n",
              "      border-left-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "    }\n",
              "    30% {\n",
              "      border-color: transparent;\n",
              "      border-left-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "      border-right-color: var(--fill-color);\n",
              "    }\n",
              "    40% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "    }\n",
              "    60% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "    }\n",
              "    80% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "      border-bottom-color: var(--fill-color);\n",
              "    }\n",
              "    90% {\n",
              "      border-color: transparent;\n",
              "      border-bottom-color: var(--fill-color);\n",
              "    }\n",
              "  }\n",
              "</style>\n",
              "\n",
              "  <script>\n",
              "    async function quickchart(key) {\n",
              "      const quickchartButtonEl =\n",
              "        document.querySelector('#' + key + ' button');\n",
              "      quickchartButtonEl.disabled = true;  // To prevent multiple clicks.\n",
              "      quickchartButtonEl.classList.add('colab-df-spinner');\n",
              "      try {\n",
              "        const charts = await google.colab.kernel.invokeFunction(\n",
              "            'suggestCharts', [key], {});\n",
              "      } catch (error) {\n",
              "        console.error('Error during call to suggestCharts:', error);\n",
              "      }\n",
              "      quickchartButtonEl.classList.remove('colab-df-spinner');\n",
              "      quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
              "    }\n",
              "    (() => {\n",
              "      let quickchartButtonEl =\n",
              "        document.querySelector('#df-828d1b5a-cf35-427a-8136-5f3779b049aa button');\n",
              "      quickchartButtonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "    })();\n",
              "  </script>\n",
              "</div>\n",
              "    </div>\n",
              "  </div>\n"
            ],
            "text/plain": [
              "                                             writers  \\\n",
              "0  [Charles W. Goddard (screenplay), Basil Dickey...   \n",
              "1                             [H.M. Walker (titles)]   \n",
              "2  [Herbert Brenon (adaptation), John Russell (ad...   \n",
              "3  [Douglas Fairbanks (story), Jack Cunningham (a...   \n",
              "4  [Ted Wilde (story), John Grey (story), Clyde B...   \n",
              "\n",
              "                                                cast  \\\n",
              "0  [Pearl White, Crane Wilbur, Paul Panzer, Edwar...   \n",
              "1  [Harold Lloyd, Mildred Davis, 'Snub' Pollard, ...   \n",
              "2  [Ronald Colman, Neil Hamilton, Ralph Forbes, A...   \n",
              "3  [Billie Dove, Tempe Pigott, Donald Crisp, Sam ...   \n",
              "4  [Harold Lloyd, Jobyna Ralston, Noah Young, Jim...   \n",
              "\n",
              "                                                plot countries  \\\n",
              "0  Young Pauline is left a lot of money when her ...     [USA]   \n",
              "1  A penniless young man tries to save an heiress...     [USA]   \n",
              "2  Michael \"Beau\" Geste leaves England in disgrac...     [USA]   \n",
              "3  Seeking revenge, an athletic young man joins t...     [USA]   \n",
              "4  An irresponsible young millionaire changes his...     [USA]   \n",
              "\n",
              "                              directors  \\\n",
              "0  [Louis J. Gasnier, Donald MacKenzie]   \n",
              "1       [Alfred J. Goulding, Hal Roach]   \n",
              "2                      [Herbert Brenon]   \n",
              "3                       [Albert Parker]   \n",
              "4                          [Sam Taylor]   \n",
              "\n",
              "                                              poster  \\\n",
              "0  https://m.media-amazon.com/images/M/MV5BMzgxOD...   \n",
              "1  https://m.media-amazon.com/images/M/MV5BNzE1OW...   \n",
              "2                                               None   \n",
              "3  https://m.media-amazon.com/images/M/MV5BMzU0ND...   \n",
              "4  https://m.media-amazon.com/images/M/MV5BMTcxMT...   \n",
              "\n",
              "                       genres                                         imdb  \\\n",
              "0                    [Action]    {'id': 4465, 'rating': 7.6, 'votes': 744}   \n",
              "1     [Comedy, Short, Action]   {'id': 10146, 'rating': 7.0, 'votes': 639}   \n",
              "2  [Action, Adventure, Drama]   {'id': 16634, 'rating': 6.9, 'votes': 222}   \n",
              "3         [Adventure, Action]  {'id': 16654, 'rating': 7.2, 'votes': 1146}   \n",
              "4   [Action, Comedy, Romance]   {'id': 16895, 'rating': 7.6, 'votes': 918}   \n",
              "\n",
              "   num_mflix_comments  runtime  \\\n",
              "0                   0    199.0   \n",
              "1                   0     22.0   \n",
              "2                   0    101.0   \n",
              "3                   1     88.0   \n",
              "4                   0     58.0   \n",
              "\n",
              "                                            fullplot  languages  \\\n",
              "0  Young Pauline is left a lot of money when her ...  [English]   \n",
              "1  As a penniless man worries about how he will m...  [English]   \n",
              "2  Michael \"Beau\" Geste leaves England in disgrac...  [English]   \n",
              "3  A nobleman vows to avenge the death of his fat...       None   \n",
              "4  The Uptown Boy, J. Harold Manners (Lloyd) is a...  [English]   \n",
              "\n",
              "                   title                                             awards  \\\n",
              "0  The Perils of Pauline    {'nominations': 0, 'text': '1 win.', 'wins': 1}   \n",
              "1     From Hand to Mouth  {'nominations': 1, 'text': '1 nomination.', 'w...   \n",
              "2             Beau Geste    {'nominations': 0, 'text': '1 win.', 'wins': 1}   \n",
              "3       The Black Pirate    {'nominations': 0, 'text': '1 win.', 'wins': 1}   \n",
              "4      For Heaven's Sake  {'nominations': 1, 'text': '1 nomination.', 'w...   \n",
              "\n",
              "    type   rated  metacritic  \\\n",
              "0  movie    None         NaN   \n",
              "1  movie    TV-G         NaN   \n",
              "2  movie    None         NaN   \n",
              "3  movie    None         NaN   \n",
              "4  movie  PASSED         NaN   \n",
              "\n",
              "                            plot_embedding_optimised  \n",
              "0  [0.015450738370418549, -0.0037871389649808407,...  \n",
              "1  [-0.024403288960456848, 0.009791935794055462, ...  \n",
              "2  [-0.0314427949488163, 0.07585323601961136, 0.0...  \n",
              "3  [0.021738531067967415, 0.06848915666341782, 0....  \n",
              "4  [0.008178990334272385, -0.019387762993574142, ...  "
            ]
          },
          "execution_count": 5,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "import openai\n",
        "from google.colab import userdata\n",
        "\n",
        "openai.api_key = userdata.get(\"open_ai\")\n",
        "\n",
        "EMBEDDING_MODEL = \"text-embedding-3-small\"\n",
        "\n",
        "\n",
        "def get_embedding(text):\n",
        "    \"\"\"Generate an embedding for the given text using OpenAI's API.\"\"\"\n",
        "\n",
        "    # Check for valid input\n",
        "    if not text or not isinstance(text, str):\n",
        "        return None\n",
        "\n",
        "    try:\n",
        "        # Call OpenAI API to get the embedding\n",
        "        embedding = (\n",
        "            openai.embeddings.create(input=text, model=EMBEDDING_MODEL)\n",
        "            .data[0]\n",
        "            .embedding\n",
        "        )\n",
        "        return embedding\n",
        "    except Exception as e:\n",
        "        print(f\"Error in get_embedding: {e}\")\n",
        "        return None\n",
        "\n",
        "\n",
        "dataset_df[\"plot_embedding_optimised\"] = dataset_df[\"plot\"].apply(get_embedding)\n",
        "\n",
        "dataset_df.head()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "7SdAbIlj__gL"
      },
      "source": [
        "## Step 5: Vector Database Setup and Data Ingestion\n",
        "\n",
        "MongoDB acts as both an operational and a vector database. It offers a database solution that efficiently stores, queries and retrieves vector embeddings—the advantages of this lie in the simplicity of database maintenance, management and cost.\n",
        "\n",
        "**To create a new MongoDB database, set up a database cluster:**\n",
        "\n",
        "1. Head over to MongoDB official site and register for a [free MongoDB Atlas account](https://www.mongodb.com/cloud/atlas/register), or for existing users, [sign into MongoDB Atlas](https://account.mongodb.com/account/login?nds=true).\n",
        "\n",
        "2. Select the 'Database' option on the left-hand pane, which will navigate to the Database Deployment page, where there is a deployment specification of any existing cluster. Create a new database cluster by clicking on the \"+Create\" button.\n",
        "\n",
        "3.   Select all the applicable configurations for the database cluster. Once all the configuration options are selected, click the “Create Cluster” button to deploy the newly created cluster. MongoDB also enables the creation of free clusters on the “Shared Tab”.\n",
        "\n",
        " *Note: Don’t forget to whitelist the IP for the Python host or 0.0.0.0/0 for any IP when creating proof of concepts.*\n",
        "\n",
        "4. After successfully creating and deploying the cluster, the cluster becomes accessible on the ‘Database Deployment’ page.\n",
        "\n",
        "5. Click on the “Connect” button of the cluster to view the option to set up a connection to the cluster via various language drivers.\n",
        "\n",
        "6. This tutorial only requires the cluster's URI(unique resource identifier). Grab the URI and copy it into the Google Colabs Secrets environment in a variable named `MONGO_URI` or place it in a .env file or equivalent.\n",
        "\n",
        "\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 16,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "vrj2zgPWdFGA",
        "outputId": "246dce04-a13e-4c34-f15d-21097a6ea8ee"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Connection to MongoDB successful\n"
          ]
        }
      ],
      "source": [
        "import pymongo\n",
        "from google.colab import userdata\n",
        "\n",
        "\n",
        "def get_mongo_client(mongo_uri):\n",
        "    \"\"\"Establish connection to the MongoDB.\"\"\"\n",
        "    try:\n",
        "        client = pymongo.MongoClient(\n",
        "            mongo_uri, appname=\"devrel.showcase.rag_openai_text_embedding_3\"\n",
        "        )\n",
        "        print(\"Connection to MongoDB successful\")\n",
        "        return client\n",
        "    except pymongo.errors.ConnectionFailure as e:\n",
        "        print(f\"Connection failed: {e}\")\n",
        "        return None\n",
        "\n",
        "\n",
        "mongo_uri = userdata.get(\"MONGO_URI_2\")\n",
        "if not mongo_uri:\n",
        "    print(\"MONGO_URI not set in environment variables\")\n",
        "\n",
        "mongo_client = get_mongo_client(mongo_uri)\n",
        "\n",
        "# Ingest data into MongoDB\n",
        "db = mongo_client[\"movies\"]\n",
        "collection = db[\"movie_collection\"]"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 11,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "8WMQTNfG0xnC",
        "outputId": "b4d8b2ae-afef-4fb3-df1c-1085befd3b8d"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "DeleteResult({'n': 2946, 'electionId': ObjectId('7fffffff0000000000000002'), 'opTime': {'ts': Timestamp(1706777337, 131), 't': 2}, 'ok': 1.0, '$clusterTime': {'clusterTime': Timestamp(1706777337, 136), 'signature': {'hash': b'I}`\\x92v\\x00\\n\\x1e\\x00_\\x13}\\x875O\\xa1[\\xb4\\xf6\\x18', 'keyId': 7330233208207835141}}, 'operationTime': Timestamp(1706777337, 131)}, acknowledged=True)"
            ]
          },
          "execution_count": 11,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "# Delete any existing records in the collection\n",
        "collection.delete_many({})"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 12,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "czc3wOnDz7dm",
        "outputId": "6bc320f7-d0cc-4be8-9b2b-a3afc4eedf81"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Data ingestion into MongoDB completed\n"
          ]
        }
      ],
      "source": [
        "documents = dataset_df.to_dict(\"records\")\n",
        "collection.insert_many(documents)\n",
        "\n",
        "print(\"Data ingestion into MongoDB completed\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "WRp0qGSkp4tx"
      },
      "source": [
        "## Step 6: Create a Vector Search Index\n",
        "\n",
        "At this point make sure that your vector index is created via MongoDB Atlas.\n",
        "Follow instructions here:\n",
        "\n",
        "This next step is mandatory for conducting efficient and accurate vector-based searches based on the vector embeddings stored within the documents in the ‘movie_collection’ collection. Creating a Vector Search Index enables the ability to traverse the documents efficiently to retrieve documents with embeddings that match the query embedding based on vector similarity. Go here to read more about [MongoDB Vector Search Index](https://www.mongodb.com/docs/atlas/atlas-search/field-types/knn-vector/).\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "41EnDZx5B0Hj"
      },
      "source": [
        "## Step 7: Perform Vector Search on User Queries\n",
        "\n",
        "This step combines all the activities in the previous step to provide the functionality of conducting vector search on stored records based on embedded user queries.\n",
        "\n",
        "This step implements a function that returns a vector search result by generating a query embedding and defining a MongoDB aggregation pipeline. The pipeline, consisting of the `$vectorSearch` and `$project` stages, queries using the generated vector and formats the results to include only required information like plot, title, and genres while incorporating a search score for each result.\n",
        "\n",
        "This selective projection enhances query performance by reducing data transfer and optimizes the use of network and memory resources, which is especially critical when handling large datasets. For AI Engineers and Developers considering data security at an early stage, the chances of sensitive data leaked to the client side can be minimized by carefully excluding fields irrelevant to the user's query.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 20,
      "metadata": {
        "id": "YQXL69LtvpVQ"
      },
      "outputs": [],
      "source": [
        "def vector_search(user_query, collection):\n",
        "    \"\"\"\n",
        "    Perform a vector search in the MongoDB collection based on the user query.\n",
        "\n",
        "    Args:\n",
        "    user_query (str): The user's query string.\n",
        "    collection (MongoCollection): The MongoDB collection to search.\n",
        "\n",
        "    Returns:\n",
        "    list: A list of matching documents.\n",
        "    \"\"\"\n",
        "\n",
        "    # Generate embedding for the user query\n",
        "    query_embedding = get_embedding(user_query)\n",
        "\n",
        "    if query_embedding is None:\n",
        "        return \"Invalid query or embedding generation failed.\"\n",
        "\n",
        "    # Define the vector search pipeline\n",
        "    pipeline = [\n",
        "        {\n",
        "            \"$vectorSearch\": {\n",
        "                \"index\": \"vector_index\",\n",
        "                \"queryVector\": query_embedding,\n",
        "                \"path\": \"plot_embedding_optimised\",\n",
        "                \"numCandidates\": 150,  # Number of candidate matches to consider\n",
        "                \"limit\": 5,  # Return top 5 matches\n",
        "            }\n",
        "        },\n",
        "        {\n",
        "            \"$project\": {\n",
        "                \"_id\": 0,  # Exclude the _id field\n",
        "                \"plot\": 1,  # Include the plot field\n",
        "                \"title\": 1,  # Include the title field\n",
        "                \"genres\": 1,  # Include the genres field\n",
        "                \"score\": {\"$meta\": \"vectorSearchScore\"},  # Include the search score\n",
        "            }\n",
        "        },\n",
        "    ]\n",
        "\n",
        "    # Execute the search\n",
        "    results = collection.aggregate(pipeline)\n",
        "    return list(results)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "2Uu3EqLvCHce"
      },
      "source": [
        "## Step 8: Handling User Query and Result\n",
        "\n",
        "The final step in the implementation phase focuses on the practical application of our vector search functionality and AI integration to handle user queries effectively.\n",
        "\n",
        "The handle_user_query function performs a vector search on the MongoDB collection based on the user's query and utilizes OpenAI's GPT-3.5 model to generate context-aware responses.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 21,
      "metadata": {
        "id": "Oq_-i3fq1NbN"
      },
      "outputs": [],
      "source": [
        "def handle_user_query(query, collection):\n",
        "    get_knowledge = vector_search(query, collection)\n",
        "\n",
        "    search_result = \"\"\n",
        "    for result in get_knowledge:\n",
        "        search_result += (\n",
        "            f\"Title: {result.get('title', 'N/A')}, Plot: {result.get('plot', 'N/A')}\\n\"\n",
        "        )\n",
        "\n",
        "    completion = openai.chat.completions.create(\n",
        "        model=\"gpt-3.5-turbo\",\n",
        "        messages=[\n",
        "            {\"role\": \"system\", \"content\": \"You are a movie recommendation system.\"},\n",
        "            {\n",
        "                \"role\": \"user\",\n",
        "                \"content\": \"Answer this user query: \"\n",
        "                + query\n",
        "                + \" with the following context: \"\n",
        "                + search_result,\n",
        "            },\n",
        "        ],\n",
        "    )\n",
        "\n",
        "    return (completion.choices[0].message.content), search_result"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 22,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "AXPiV6XodTS1",
        "outputId": "6f265561-a61c-4f10-ee94-8e6a10603492"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Response: Based on the given context, the best romantic movie recommendation would be \"Gorgeous\". This movie combines romance and the story of a girl searching for love with a kind-hearted professional fighter.\n",
            "Source Information: \n",
            "Title: Run, Plot: This action movie is filled with romance and adventure. As Abhisek fights for his life against the forces of crime and injustice, he meets Bhoomika, who captures his heart.\n",
            "Title: China Girl, Plot: A modern day Romeo & Juliet story is told in New York when an Italian boy and a Chinese girl become lovers, causing a tragic conflict between ethnic gangs.\n",
            "Title: Gorgeous, Plot: A romantic girl travels to Hong Kong in search of certain love but instead meets a kind-hearted professional fighter with whom she begins to fall for instead.\n",
            "Title: Once a Thief, Plot: A romantic and action packed story of three best friends, a group of high end art thieves, who come into trouble when a love-triangle forms between them.\n",
            "Title: House of Flying Daggers, Plot: A romantic police captain breaks a beautiful member of a rebel group out of prison to help her rejoin her fellows, but things are not what they seem.\n",
            "\n"
          ]
        }
      ],
      "source": [
        "# 6. Conduct query with retrival of sources\n",
        "query = \"What is the best romantic movie to watch?\"\n",
        "response, source_information = handle_user_query(query, collection)\n",
        "\n",
        "print(f\"Response: {response}\")\n",
        "print(f\"Source Information: \\n{source_information}\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 15,
      "metadata": {
        "id": "WhxxNwkHOEb-"
      },
      "outputs": [],
      "source": []
    }
  ],
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "name": "python"
    },
    "widgets": {
      "application/vnd.jupyter.widget-state+json": {
        "0d8907102d4a4beba76c057296d52bcf": {
          "model_module": "@jupyter-widgets/base",
          "model_module_version": "1.2.0",
          "model_name": "LayoutModel",
          "state": {
            "_model_module": "@jupyter-widgets/base",
            "_model_module_version": "1.2.0",
            "_model_name": "LayoutModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "LayoutView",
            "align_content": null,
            "align_items": null,
            "align_self": null,
            "border": null,
            "bottom": null,
            "display": null,
            "flex": null,
            "flex_flow": null,
            "grid_area": null,
            "grid_auto_columns": null,
            "grid_auto_flow": null,
            "grid_auto_rows": null,
            "grid_column": null,
            "grid_gap": null,
            "grid_row": null,
            "grid_template_areas": null,
            "grid_template_columns": null,
            "grid_template_rows": null,
            "height": null,
            "justify_content": null,
            "justify_items": null,
            "left": null,
            "margin": null,
            "max_height": null,
            "max_width": null,
            "min_height": null,
            "min_width": null,
            "object_fit": null,
            "object_position": null,
            "order": null,
            "overflow": null,
            "overflow_x": null,
            "overflow_y": null,
            "padding": null,
            "right": null,
            "top": null,
            "visibility": null,
            "width": null
          }
        },
        "12e551e39fd04cb489c67125ba4a9587": {
          "model_module": "@jupyter-widgets/base",
          "model_module_version": "1.2.0",
          "model_name": "LayoutModel",
          "state": {
            "_model_module": "@jupyter-widgets/base",
            "_model_module_version": "1.2.0",
            "_model_name": "LayoutModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "LayoutView",
            "align_content": null,
            "align_items": null,
            "align_self": null,
            "border": null,
            "bottom": null,
            "display": null,
            "flex": null,
            "flex_flow": null,
            "grid_area": null,
            "grid_auto_columns": null,
            "grid_auto_flow": null,
            "grid_auto_rows": null,
            "grid_column": null,
            "grid_gap": null,
            "grid_row": null,
            "grid_template_areas": null,
            "grid_template_columns": null,
            "grid_template_rows": null,
            "height": null,
            "justify_content": null,
            "justify_items": null,
            "left": null,
            "margin": null,
            "max_height": null,
            "max_width": null,
            "min_height": null,
            "min_width": null,
            "object_fit": null,
            "object_position": null,
            "order": null,
            "overflow": null,
            "overflow_x": null,
            "overflow_y": null,
            "padding": null,
            "right": null,
            "top": null,
            "visibility": null,
            "width": null
          }
        },
        "130cc31ff46f445e9f7e607f8d8aca45": {
          "model_module": "@jupyter-widgets/controls",
          "model_module_version": "1.5.0",
          "model_name": "ProgressStyleModel",
          "state": {
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "ProgressStyleModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "StyleView",
            "bar_color": null,
            "description_width": ""
          }
        },
        "13e7387663b840efbabbacc7b02ecbf1": {
          "model_module": "@jupyter-widgets/base",
          "model_module_version": "1.2.0",
          "model_name": "LayoutModel",
          "state": {
            "_model_module": "@jupyter-widgets/base",
            "_model_module_version": "1.2.0",
            "_model_name": "LayoutModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "LayoutView",
            "align_content": null,
            "align_items": null,
            "align_self": null,
            "border": null,
            "bottom": null,
            "display": null,
            "flex": null,
            "flex_flow": null,
            "grid_area": null,
            "grid_auto_columns": null,
            "grid_auto_flow": null,
            "grid_auto_rows": null,
            "grid_column": null,
            "grid_gap": null,
            "grid_row": null,
            "grid_template_areas": null,
            "grid_template_columns": null,
            "grid_template_rows": null,
            "height": null,
            "justify_content": null,
            "justify_items": null,
            "left": null,
            "margin": null,
            "max_height": null,
            "max_width": null,
            "min_height": null,
            "min_width": null,
            "object_fit": null,
            "object_position": null,
            "order": null,
            "overflow": null,
            "overflow_x": null,
            "overflow_y": null,
            "padding": null,
            "right": null,
            "top": null,
            "visibility": null,
            "width": null
          }
        },
        "15bf28adec2942c58e1101068abbe5cc": {
          "model_module": "@jupyter-widgets/base",
          "model_module_version": "1.2.0",
          "model_name": "LayoutModel",
          "state": {
            "_model_module": "@jupyter-widgets/base",
            "_model_module_version": "1.2.0",
            "_model_name": "LayoutModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "LayoutView",
            "align_content": null,
            "align_items": null,
            "align_self": null,
            "border": null,
            "bottom": null,
            "display": null,
            "flex": null,
            "flex_flow": null,
            "grid_area": null,
            "grid_auto_columns": null,
            "grid_auto_flow": null,
            "grid_auto_rows": null,
            "grid_column": null,
            "grid_gap": null,
            "grid_row": null,
            "grid_template_areas": null,
            "grid_template_columns": null,
            "grid_template_rows": null,
            "height": null,
            "justify_content": null,
            "justify_items": null,
            "left": null,
            "margin": null,
            "max_height": null,
            "max_width": null,
            "min_height": null,
            "min_width": null,
            "object_fit": null,
            "object_position": null,
            "order": null,
            "overflow": null,
            "overflow_x": null,
            "overflow_y": null,
            "padding": null,
            "right": null,
            "top": null,
            "visibility": null,
            "width": null
          }
        },
        "182fd07b7bd040c4a91cd8da4cd7062f": {
          "model_module": "@jupyter-widgets/base",
          "model_module_version": "1.2.0",
          "model_name": "LayoutModel",
          "state": {
            "_model_module": "@jupyter-widgets/base",
            "_model_module_version": "1.2.0",
            "_model_name": "LayoutModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "LayoutView",
            "align_content": null,
            "align_items": null,
            "align_self": null,
            "border": null,
            "bottom": null,
            "display": null,
            "flex": null,
            "flex_flow": null,
            "grid_area": null,
            "grid_auto_columns": null,
            "grid_auto_flow": null,
            "grid_auto_rows": null,
            "grid_column": null,
            "grid_gap": null,
            "grid_row": null,
            "grid_template_areas": null,
            "grid_template_columns": null,
            "grid_template_rows": null,
            "height": null,
            "justify_content": null,
            "justify_items": null,
            "left": null,
            "margin": null,
            "max_height": null,
            "max_width": null,
            "min_height": null,
            "min_width": null,
            "object_fit": null,
            "object_position": null,
            "order": null,
            "overflow": null,
            "overflow_x": null,
            "overflow_y": null,
            "padding": null,
            "right": null,
            "top": null,
            "visibility": null,
            "width": null
          }
        },
        "1a81f5d565ca479b85aa096a3a0db260": {
          "model_module": "@jupyter-widgets/controls",
          "model_module_version": "1.5.0",
          "model_name": "FloatProgressModel",
          "state": {
            "_dom_classes": [],
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "FloatProgressModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/controls",
            "_view_module_version": "1.5.0",
            "_view_name": "ProgressView",
            "bar_style": "success",
            "description": "",
            "description_tooltip": null,
            "layout": "IPY_MODEL_9c4d78299d064b9fb5d062b197be73f5",
            "max": 42271667,
            "min": 0,
            "orientation": "horizontal",
            "style": "IPY_MODEL_130cc31ff46f445e9f7e607f8d8aca45",
            "value": 42271667
          }
        },
        "26e5a2e0ef0745e7a1278272cbf478b1": {
          "model_module": "@jupyter-widgets/base",
          "model_module_version": "1.2.0",
          "model_name": "LayoutModel",
          "state": {
            "_model_module": "@jupyter-widgets/base",
            "_model_module_version": "1.2.0",
            "_model_name": "LayoutModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "LayoutView",
            "align_content": null,
            "align_items": null,
            "align_self": null,
            "border": null,
            "bottom": null,
            "display": null,
            "flex": null,
            "flex_flow": null,
            "grid_area": null,
            "grid_auto_columns": null,
            "grid_auto_flow": null,
            "grid_auto_rows": null,
            "grid_column": null,
            "grid_gap": null,
            "grid_row": null,
            "grid_template_areas": null,
            "grid_template_columns": null,
            "grid_template_rows": null,
            "height": null,
            "justify_content": null,
            "justify_items": null,
            "left": null,
            "margin": null,
            "max_height": null,
            "max_width": null,
            "min_height": null,
            "min_width": null,
            "object_fit": null,
            "object_position": null,
            "order": null,
            "overflow": null,
            "overflow_x": null,
            "overflow_y": null,
            "padding": null,
            "right": null,
            "top": null,
            "visibility": null,
            "width": null
          }
        },
        "2fe30671551b45719066821912e67cc6": {
          "model_module": "@jupyter-widgets/controls",
          "model_module_version": "1.5.0",
          "model_name": "HTMLModel",
          "state": {
            "_dom_classes": [],
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "HTMLModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/controls",
            "_view_module_version": "1.5.0",
            "_view_name": "HTMLView",
            "description": "",
            "description_tooltip": null,
            "layout": "IPY_MODEL_13e7387663b840efbabbacc7b02ecbf1",
            "placeholder": "​",
            "style": "IPY_MODEL_6417cd7c833040d39e9cc4286eefe9aa",
            "value": "Downloading readme: 100%"
          }
        },
        "35072030d3d64381b1ad1920bdf7487d": {
          "model_module": "@jupyter-widgets/controls",
          "model_module_version": "1.5.0",
          "model_name": "HTMLModel",
          "state": {
            "_dom_classes": [],
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "HTMLModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/controls",
            "_view_module_version": "1.5.0",
            "_view_name": "HTMLView",
            "description": "",
            "description_tooltip": null,
            "layout": "IPY_MODEL_26e5a2e0ef0745e7a1278272cbf478b1",
            "placeholder": "​",
            "style": "IPY_MODEL_895e63dd0a994b2a98fd2279f2fcd488",
            "value": " 42.3M/42.3M [00:02&lt;00:00, 17.7MB/s]"
          }
        },
        "39a1be31b49e4a42819a14f792952b9f": {
          "model_module": "@jupyter-widgets/controls",
          "model_module_version": "1.5.0",
          "model_name": "FloatProgressModel",
          "state": {
            "_dom_classes": [],
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "FloatProgressModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/controls",
            "_view_module_version": "1.5.0",
            "_view_name": "ProgressView",
            "bar_style": "success",
            "description": "",
            "description_tooltip": null,
            "layout": "IPY_MODEL_6b134ba7d60d48999284332650a85785",
            "max": 2707,
            "min": 0,
            "orientation": "horizontal",
            "style": "IPY_MODEL_55f8d507700c4d03b38d861148f24fea",
            "value": 2707
          }
        },
        "41c2a3273bf842848ed271fbfd450a63": {
          "model_module": "@jupyter-widgets/controls",
          "model_module_version": "1.5.0",
          "model_name": "HBoxModel",
          "state": {
            "_dom_classes": [],
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "HBoxModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/controls",
            "_view_module_version": "1.5.0",
            "_view_name": "HBoxView",
            "box_style": "",
            "children": [
              "IPY_MODEL_d13f0c1006b44683bb475ae56a88cf67",
              "IPY_MODEL_df79edc3b5ed4454840acc5dbf11459d",
              "IPY_MODEL_681ac17d999f473b897d361a8b2d3082"
            ],
            "layout": "IPY_MODEL_ac331b35738c4557b6bbf3b38e901711"
          }
        },
        "41cf9f5ecbe1417da972dcdd6b765741": {
          "model_module": "@jupyter-widgets/controls",
          "model_module_version": "1.5.0",
          "model_name": "HTMLModel",
          "state": {
            "_dom_classes": [],
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "HTMLModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/controls",
            "_view_module_version": "1.5.0",
            "_view_name": "HTMLView",
            "description": "",
            "description_tooltip": null,
            "layout": "IPY_MODEL_c0a118cd37284c45bf29b19c8f5bc8f5",
            "placeholder": "​",
            "style": "IPY_MODEL_e99edb4b89bc48e29c0086f8be068c14",
            "value": "Downloading data: 100%"
          }
        },
        "5218f9ff0f064fb0832ddb7308922082": {
          "model_module": "@jupyter-widgets/controls",
          "model_module_version": "1.5.0",
          "model_name": "DescriptionStyleModel",
          "state": {
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "DescriptionStyleModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "StyleView",
            "description_width": ""
          }
        },
        "55f8d507700c4d03b38d861148f24fea": {
          "model_module": "@jupyter-widgets/controls",
          "model_module_version": "1.5.0",
          "model_name": "ProgressStyleModel",
          "state": {
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "ProgressStyleModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "StyleView",
            "bar_color": null,
            "description_width": ""
          }
        },
        "6417cd7c833040d39e9cc4286eefe9aa": {
          "model_module": "@jupyter-widgets/controls",
          "model_module_version": "1.5.0",
          "model_name": "DescriptionStyleModel",
          "state": {
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "DescriptionStyleModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "StyleView",
            "description_width": ""
          }
        },
        "681ac17d999f473b897d361a8b2d3082": {
          "model_module": "@jupyter-widgets/controls",
          "model_module_version": "1.5.0",
          "model_name": "HTMLModel",
          "state": {
            "_dom_classes": [],
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "HTMLModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/controls",
            "_view_module_version": "1.5.0",
            "_view_name": "HTMLView",
            "description": "",
            "description_tooltip": null,
            "layout": "IPY_MODEL_15bf28adec2942c58e1101068abbe5cc",
            "placeholder": "​",
            "style": "IPY_MODEL_b3937e60c4cc4831b6e18e65d6a2a4ce",
            "value": " 1500/0 [00:02&lt;00:00, 662.94 examples/s]"
          }
        },
        "6b134ba7d60d48999284332650a85785": {
          "model_module": "@jupyter-widgets/base",
          "model_module_version": "1.2.0",
          "model_name": "LayoutModel",
          "state": {
            "_model_module": "@jupyter-widgets/base",
            "_model_module_version": "1.2.0",
            "_model_name": "LayoutModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "LayoutView",
            "align_content": null,
            "align_items": null,
            "align_self": null,
            "border": null,
            "bottom": null,
            "display": null,
            "flex": null,
            "flex_flow": null,
            "grid_area": null,
            "grid_auto_columns": null,
            "grid_auto_flow": null,
            "grid_auto_rows": null,
            "grid_column": null,
            "grid_gap": null,
            "grid_row": null,
            "grid_template_areas": null,
            "grid_template_columns": null,
            "grid_template_rows": null,
            "height": null,
            "justify_content": null,
            "justify_items": null,
            "left": null,
            "margin": null,
            "max_height": null,
            "max_width": null,
            "min_height": null,
            "min_width": null,
            "object_fit": null,
            "object_position": null,
            "order": null,
            "overflow": null,
            "overflow_x": null,
            "overflow_y": null,
            "padding": null,
            "right": null,
            "top": null,
            "visibility": null,
            "width": null
          }
        },
        "6ff3a4b06afd46018f7c595b575ecd1d": {
          "model_module": "@jupyter-widgets/controls",
          "model_module_version": "1.5.0",
          "model_name": "ProgressStyleModel",
          "state": {
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "ProgressStyleModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "StyleView",
            "bar_color": null,
            "description_width": ""
          }
        },
        "70ae19241fab4ac295a39c4499910a97": {
          "model_module": "@jupyter-widgets/controls",
          "model_module_version": "1.5.0",
          "model_name": "DescriptionStyleModel",
          "state": {
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "DescriptionStyleModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "StyleView",
            "description_width": ""
          }
        },
        "7d1087206fb5440eaa64f802a976029d": {
          "model_module": "@jupyter-widgets/base",
          "model_module_version": "1.2.0",
          "model_name": "LayoutModel",
          "state": {
            "_model_module": "@jupyter-widgets/base",
            "_model_module_version": "1.2.0",
            "_model_name": "LayoutModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "LayoutView",
            "align_content": null,
            "align_items": null,
            "align_self": null,
            "border": null,
            "bottom": null,
            "display": null,
            "flex": null,
            "flex_flow": null,
            "grid_area": null,
            "grid_auto_columns": null,
            "grid_auto_flow": null,
            "grid_auto_rows": null,
            "grid_column": null,
            "grid_gap": null,
            "grid_row": null,
            "grid_template_areas": null,
            "grid_template_columns": null,
            "grid_template_rows": null,
            "height": null,
            "justify_content": null,
            "justify_items": null,
            "left": null,
            "margin": null,
            "max_height": null,
            "max_width": null,
            "min_height": null,
            "min_width": null,
            "object_fit": null,
            "object_position": null,
            "order": null,
            "overflow": null,
            "overflow_x": null,
            "overflow_y": null,
            "padding": null,
            "right": null,
            "top": null,
            "visibility": null,
            "width": "20px"
          }
        },
        "895e63dd0a994b2a98fd2279f2fcd488": {
          "model_module": "@jupyter-widgets/controls",
          "model_module_version": "1.5.0",
          "model_name": "DescriptionStyleModel",
          "state": {
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "DescriptionStyleModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "StyleView",
            "description_width": ""
          }
        },
        "9c4d78299d064b9fb5d062b197be73f5": {
          "model_module": "@jupyter-widgets/base",
          "model_module_version": "1.2.0",
          "model_name": "LayoutModel",
          "state": {
            "_model_module": "@jupyter-widgets/base",
            "_model_module_version": "1.2.0",
            "_model_name": "LayoutModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "LayoutView",
            "align_content": null,
            "align_items": null,
            "align_self": null,
            "border": null,
            "bottom": null,
            "display": null,
            "flex": null,
            "flex_flow": null,
            "grid_area": null,
            "grid_auto_columns": null,
            "grid_auto_flow": null,
            "grid_auto_rows": null,
            "grid_column": null,
            "grid_gap": null,
            "grid_row": null,
            "grid_template_areas": null,
            "grid_template_columns": null,
            "grid_template_rows": null,
            "height": null,
            "justify_content": null,
            "justify_items": null,
            "left": null,
            "margin": null,
            "max_height": null,
            "max_width": null,
            "min_height": null,
            "min_width": null,
            "object_fit": null,
            "object_position": null,
            "order": null,
            "overflow": null,
            "overflow_x": null,
            "overflow_y": null,
            "padding": null,
            "right": null,
            "top": null,
            "visibility": null,
            "width": null
          }
        },
        "ac331b35738c4557b6bbf3b38e901711": {
          "model_module": "@jupyter-widgets/base",
          "model_module_version": "1.2.0",
          "model_name": "LayoutModel",
          "state": {
            "_model_module": "@jupyter-widgets/base",
            "_model_module_version": "1.2.0",
            "_model_name": "LayoutModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "LayoutView",
            "align_content": null,
            "align_items": null,
            "align_self": null,
            "border": null,
            "bottom": null,
            "display": null,
            "flex": null,
            "flex_flow": null,
            "grid_area": null,
            "grid_auto_columns": null,
            "grid_auto_flow": null,
            "grid_auto_rows": null,
            "grid_column": null,
            "grid_gap": null,
            "grid_row": null,
            "grid_template_areas": null,
            "grid_template_columns": null,
            "grid_template_rows": null,
            "height": null,
            "justify_content": null,
            "justify_items": null,
            "left": null,
            "margin": null,
            "max_height": null,
            "max_width": null,
            "min_height": null,
            "min_width": null,
            "object_fit": null,
            "object_position": null,
            "order": null,
            "overflow": null,
            "overflow_x": null,
            "overflow_y": null,
            "padding": null,
            "right": null,
            "top": null,
            "visibility": null,
            "width": null
          }
        },
        "b3937e60c4cc4831b6e18e65d6a2a4ce": {
          "model_module": "@jupyter-widgets/controls",
          "model_module_version": "1.5.0",
          "model_name": "DescriptionStyleModel",
          "state": {
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "DescriptionStyleModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "StyleView",
            "description_width": ""
          }
        },
        "b956bc06fe3849deaa765f8ec4890635": {
          "model_module": "@jupyter-widgets/controls",
          "model_module_version": "1.5.0",
          "model_name": "HBoxModel",
          "state": {
            "_dom_classes": [],
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "HBoxModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/controls",
            "_view_module_version": "1.5.0",
            "_view_name": "HBoxView",
            "box_style": "",
            "children": [
              "IPY_MODEL_41cf9f5ecbe1417da972dcdd6b765741",
              "IPY_MODEL_1a81f5d565ca479b85aa096a3a0db260",
              "IPY_MODEL_35072030d3d64381b1ad1920bdf7487d"
            ],
            "layout": "IPY_MODEL_12e551e39fd04cb489c67125ba4a9587"
          }
        },
        "bffa8ac928a94aa19b3506327879ee05": {
          "model_module": "@jupyter-widgets/controls",
          "model_module_version": "1.5.0",
          "model_name": "HTMLModel",
          "state": {
            "_dom_classes": [],
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "HTMLModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/controls",
            "_view_module_version": "1.5.0",
            "_view_name": "HTMLView",
            "description": "",
            "description_tooltip": null,
            "layout": "IPY_MODEL_182fd07b7bd040c4a91cd8da4cd7062f",
            "placeholder": "​",
            "style": "IPY_MODEL_5218f9ff0f064fb0832ddb7308922082",
            "value": " 2.71k/2.71k [00:00&lt;00:00, 125kB/s]"
          }
        },
        "c0a118cd37284c45bf29b19c8f5bc8f5": {
          "model_module": "@jupyter-widgets/base",
          "model_module_version": "1.2.0",
          "model_name": "LayoutModel",
          "state": {
            "_model_module": "@jupyter-widgets/base",
            "_model_module_version": "1.2.0",
            "_model_name": "LayoutModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "LayoutView",
            "align_content": null,
            "align_items": null,
            "align_self": null,
            "border": null,
            "bottom": null,
            "display": null,
            "flex": null,
            "flex_flow": null,
            "grid_area": null,
            "grid_auto_columns": null,
            "grid_auto_flow": null,
            "grid_auto_rows": null,
            "grid_column": null,
            "grid_gap": null,
            "grid_row": null,
            "grid_template_areas": null,
            "grid_template_columns": null,
            "grid_template_rows": null,
            "height": null,
            "justify_content": null,
            "justify_items": null,
            "left": null,
            "margin": null,
            "max_height": null,
            "max_width": null,
            "min_height": null,
            "min_width": null,
            "object_fit": null,
            "object_position": null,
            "order": null,
            "overflow": null,
            "overflow_x": null,
            "overflow_y": null,
            "padding": null,
            "right": null,
            "top": null,
            "visibility": null,
            "width": null
          }
        },
        "d068109b4d3e44639b83487d81894bb9": {
          "model_module": "@jupyter-widgets/base",
          "model_module_version": "1.2.0",
          "model_name": "LayoutModel",
          "state": {
            "_model_module": "@jupyter-widgets/base",
            "_model_module_version": "1.2.0",
            "_model_name": "LayoutModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "LayoutView",
            "align_content": null,
            "align_items": null,
            "align_self": null,
            "border": null,
            "bottom": null,
            "display": null,
            "flex": null,
            "flex_flow": null,
            "grid_area": null,
            "grid_auto_columns": null,
            "grid_auto_flow": null,
            "grid_auto_rows": null,
            "grid_column": null,
            "grid_gap": null,
            "grid_row": null,
            "grid_template_areas": null,
            "grid_template_columns": null,
            "grid_template_rows": null,
            "height": null,
            "justify_content": null,
            "justify_items": null,
            "left": null,
            "margin": null,
            "max_height": null,
            "max_width": null,
            "min_height": null,
            "min_width": null,
            "object_fit": null,
            "object_position": null,
            "order": null,
            "overflow": null,
            "overflow_x": null,
            "overflow_y": null,
            "padding": null,
            "right": null,
            "top": null,
            "visibility": null,
            "width": null
          }
        },
        "d13f0c1006b44683bb475ae56a88cf67": {
          "model_module": "@jupyter-widgets/controls",
          "model_module_version": "1.5.0",
          "model_name": "HTMLModel",
          "state": {
            "_dom_classes": [],
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "HTMLModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/controls",
            "_view_module_version": "1.5.0",
            "_view_name": "HTMLView",
            "description": "",
            "description_tooltip": null,
            "layout": "IPY_MODEL_d068109b4d3e44639b83487d81894bb9",
            "placeholder": "​",
            "style": "IPY_MODEL_70ae19241fab4ac295a39c4499910a97",
            "value": "Generating train split: "
          }
        },
        "df79edc3b5ed4454840acc5dbf11459d": {
          "model_module": "@jupyter-widgets/controls",
          "model_module_version": "1.5.0",
          "model_name": "FloatProgressModel",
          "state": {
            "_dom_classes": [],
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "FloatProgressModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/controls",
            "_view_module_version": "1.5.0",
            "_view_name": "ProgressView",
            "bar_style": "success",
            "description": "",
            "description_tooltip": null,
            "layout": "IPY_MODEL_7d1087206fb5440eaa64f802a976029d",
            "max": 1,
            "min": 0,
            "orientation": "horizontal",
            "style": "IPY_MODEL_6ff3a4b06afd46018f7c595b575ecd1d",
            "value": 1
          }
        },
        "e99edb4b89bc48e29c0086f8be068c14": {
          "model_module": "@jupyter-widgets/controls",
          "model_module_version": "1.5.0",
          "model_name": "DescriptionStyleModel",
          "state": {
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "DescriptionStyleModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "StyleView",
            "description_width": ""
          }
        },
        "ea04b37a21324337bec7cb20dea86f56": {
          "model_module": "@jupyter-widgets/controls",
          "model_module_version": "1.5.0",
          "model_name": "HBoxModel",
          "state": {
            "_dom_classes": [],
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "HBoxModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/controls",
            "_view_module_version": "1.5.0",
            "_view_name": "HBoxView",
            "box_style": "",
            "children": [
              "IPY_MODEL_2fe30671551b45719066821912e67cc6",
              "IPY_MODEL_39a1be31b49e4a42819a14f792952b9f",
              "IPY_MODEL_bffa8ac928a94aa19b3506327879ee05"
            ],
            "layout": "IPY_MODEL_0d8907102d4a4beba76c057296d52bcf"
          }
        },
        "state": {}
      }
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
